最初,第一次接触到栈回溯是由于在追查不同的业务场景问题时,通常对方仅仅给你一个接口,而为了弄清楚场景的调用方向,就需要问不同的人,尝试不同的方法,自己想尝试通过一种方法能够加速对繁杂业务代码的阅读和理解。
而最近越来越觉得,在面对崩溃问题时,大家的无措,更是应该用signal捕捉结合栈回溯来完成的。
最初在做时,查了一些的方法,大致如下:(欢迎大家提意见补充)
__builtin_return_address
:在当前代码中最常见的方式,获取上层的调用者地址,但是缺点在许多ARM平台上或者说看网上默认gcc都只能回溯一层,即只支持__builtin_return_address(0)
;-mapcs/-mapcs-frame
:原理和之前学习ARM栈帧关系一样,而该选项则是告知编译器遵循APCS(ARM Procedure Call Standard)规范,APCS规范了arm寄存器的使用、函数调用过程出栈和入栈的约定,但是缺点是在复杂的代码结构下,会造成编译器内部错误而导致的编译不过问题;-funwind-tables
:也是最终采用的方式,也是抱着尝试的心态,发现最终即使在复杂的代码结构下,也能够正常通过,其原理为链接器实际保存了帧的解压缩信息放置在专用链接器部分,而帧展开后的信息允许程序在任何点进行“窥视”上下文;-funwind-tables
可以编译通过后,使用__Unwind_Backtrace
和__Unwind_GetIP
完成了栈回溯,但是由于只能打印地址,每次需要使用nostrip的文件进行gdb查找函数名。dladdr
函数的妙用,但是却发现只有动态库中的符号可以被准确的查找到,因此将可执行文件的部分相关库改成了动态库,这无非是最直接的方法了,而后接触到的-rdynamic
编译选项再次让我叹为观止,是为了-funwind-tables
而生的没错了。-funwind-tables
,将工程的所有gcc编译选项逐一比对发现在-O1/-O2/-O3/-Os
的情况下会导致无法正常使用栈回溯(虽然后来发现即使加上-Os
也能正常了,但是希望读者可以在出现问题时往这方面去查)。最终将该工具部署后,signal捕捉常见的SIGSEGV/SIGABRT/SIGFPE后发现,还是存在部分崩溃无法栈回溯:abort()
0
异常signal11
的情况:空函数指针(unwind的固有缺陷)/mmap
环形缓冲的memset
空指针abort
和除0
的,因此一定还有哪里不同于pc。除了怀疑架构上的不同貌似已经没有怀疑点了,但是之前的unwind原理上可以实现并且是架构无关的,应该还存在可以尝试的点——glibc库的重新编译-funwind-tables
,如下是参照网上的方法依然遇到的几个坑:
callon@callon-virtual-machine:~/Documents/glibc-2.16.0$ ./configure --prefix=/home/callon/Documents/out --host=arm-linux --enable-add-on=nptl CC=arm-hisiv400-linux-gcc CXX=arm-hisiv400-linux-g++
报错1:
configure: error: you must configure in a separate build directory
解决1:
mkdir build/ out/;cd build/;../configure --prefix=/home/callon/Documents/glibc-2.16.0/out --host=arm-linux --enable-add-on=nptl CC=arm-hisiv400-linux-gcc CXX=arm-hisiv400-linux-g++
checking sysdep dirs... configure: error: The arm is not supported.
解决2:网上的方法,下载glibc-ports-2.16.0
,并解压到glibc-2.16.0
目录下重命名为ports
目录
checking add-on ports for preconfigure fragments... alpha am33 arm Old ABI no longer supported
解决3:通过找no longer supported
报错的具体位置,找到是变量的值不对导致,
callon@callon-virtual-machine:~/Documents/glibc-2.16.0$ grep -nr "no longer supported" .
./ports/sysdeps/arm/preconfigure:45: echo "Old ABI no longer supported" 2>&1
./README:25:Linux kernels is no longer supported, and we are not distributing it
最后修改--host
为海思编译器原始的前缀后正常
callon@callon-virtual-machine:~/Documents/glibc-2.16.0/build$ ../configure --prefix=/home/callon/Documents/glibc-2.16.0/out --host=arm-hisiv400-linux-gnueabi --enable-add-on=nptl CC=arm-hisiv400-linux-gcc CXX=arm-hisiv400-linux-g++
而我们是为了编译出正常可回溯的glibc,因此,参考网上的方式,编译选项整体为
callon@callon-virtual-machine:~/Documents/glibc-2.16.0/build$ ../configure --prefix=/home/callon/Documents/glibc-2.16.0/out --host=arm-hisiv400-linux-gnueabi --enable-add-on=nptl CC=arm-hisiv400-linux-gcc CXX=arm-hisiv400-linux-g++ CFLAGS="-g -O2 -U_FORTIFY_SOURCE" libc_cv_forced_unwind=yes libc_cv_c_cleanup=yes
而最终在设备上:
mkdir /mnt/nfs;mount -t nfs -o nolock xxx.xx.xx.xx:/home/callon/nfs/test /mnt/nfs;cd /mnt/nfs
mkdir /libtmp;cp -d ./lib/* /libtmp
export LD_LIBRARY_PATH=/libtmp:$LD_LIBRARY_PATH
./test
发现abort
依然无法回溯
在即将放弃时,使用
callon@callon-virtual-machine:~/Documents/glibc-2.16.0/build$ ../configure --prefix=/home/callon/Documents/glibc-2.16.0/out --host=arm-hisiv400-linux-gnueabi --enable-add-on=nptl CC=arm-hisiv400_v2-linux-gcc CXX=arm-hisiv400_v2-linux-g++ CFLAGS="-g -Os -funwind-tables"
最终成功回溯signal 6
而signal 8
多次尝试都不行,再次反汇编深入追究其原因发现:
void test_func()
{
841c: e92d4800 push {fp, lr}
8420: e28db004 add fp, sp, #4
8424: e24dd008 sub sp, sp, #8
int a = 0, b = 1;
8428: e3a03000 mov r3, #0
842c: e50b3008 str r3, [fp, #-8]
8430: e3a03001 mov r3, #1
8434: e50b300c str r3, [fp, #-12]
b /= a;
8438: e51b000c ldr r0, [fp, #-12]
843c: e51b1008 ldr r1, [fp, #-8]
8440: eb000016 bl 84a0 <__aeabi_idiv>
8444: e1a03000 mov r3, r0
8448: e50b300c str r3, [fp, #-12]
b += a;
844c: e51b200c ldr r2, [fp, #-12]
8450: e51b3008 ldr r3, [fp, #-8]
8454: e0823003 add r3, r2, r3
8458: e50b300c str r3, [fp, #-12]
}
结合网上查阅的资料发现实际上,对于满足eabi(嵌入式arm应用程序二进制接口)的arm工具链,编译时编译器将编译对象的’/'操作替换为调用__aeabi_idiv
函数,__aeabi_idiv
是由libgcc.so或gcc.a库提供的。
所以编译glibc是不够的,最好整个工具链的gcc库都更新才行,而除0
异常本身出现较少,因此不再深究。
但在多次的demo尝试中,发现memcpy
/memset
/memmove
的signal 11
崩溃居然无法追溯,但是简单的空指针赋值/strncpy
等是正常的,而且更奇怪的是,默认的libc.so.6居然可以回溯memcpy
,但是不能回溯memset
,这样就更加奇怪了,通过grep
不断的找memcpy
和strncpy
这些函数到底有什么不同时,发现glibc-ports中存在memset.S
/memcpy.S
/memmove.S
,正好没有strncpy
的汇编实现函数,并且在string/memcpy.c中加上了printf
的打印没有打出来,而strncpy
的是可以的,因此问题变成了memcpy
如何使用.c的而不是.S的实现,中途尝试过:
-fno-unwind-tables
选项删除(…/configure运行不过),或者说执行后手动删除所有-fno-unwind-tables
的地方(无影响);memcpy.S
中ENTRY
的名称为asm_memcpy
(编译错误很多)不过还好,经过不懈的怀疑到尝试到反思,
最终的方案是,将string/中的memset.c
/memcpy.c
/memmove.c
替换掉memset.S
/memcpy.S
/memmove.S
再进行编译,此时编译通过,尝试原来的demo,果然都能正常回溯了!
最终集成工程时,可以
arm-hisiv400_v2-linux-strip *.so*
将库进行strip减少内存使用,并在进程启动时加上
LD_PRELOAD=/home/debug_lib/libc.so.6 ./my_program
保证对其他进程影响最小,且backtrace
的封装也使用了glibc的自带源码参考,自己做了一些修改,主要是backtrace_symbols
的实现,因为发现在堆越界时,再次调用malloc
,此时出现malloc
内部的assert
,然后系统死锁无法恢复,这个非常严重,所以写了一种不再使用malloc
的实现方式,通过局部变量的数组传入,只要限制栈回溯的层数和合理限制result
数组的大小,是不会有任何问题的:
/* Return backtrace of current program state.
Copyright (C) 2008, 2009 Free Software Foundation, Inc.
This file is part of the GNU C Library.
Contributed by Kazu Hirata , 2008.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with the GNU C Library. If not, see
. */
#include "unwind_backtrace.h"
struct trace_arg
{
void **array;
int cnt, size;
};
#ifdef SHARED
static _Unwind_Reason_Code (*unwind_backtrace) (_Unwind_Trace_Fn, void *);
static _Unwind_VRS_Result (*unwind_vrs_get) (_Unwind_Context *,
_Unwind_VRS_RegClass,
_uw,
_Unwind_VRS_DataRepresentation,
void *);
static void *libgcc_handle;
static void
init (void)
{
libgcc_handle = __libc_dlopen ("libgcc_s.so.1");
if (libgcc_handle == NULL)
return;
unwind_backtrace = __libc_dlsym (libgcc_handle, "_Unwind_Backtrace");
unwind_vrs_get = __libc_dlsym (libgcc_handle, "_Unwind_VRS_Get");
if (unwind_vrs_get == NULL)
unwind_backtrace = NULL;
}
/* This function is identical to "_Unwind_GetGR", except that it uses
"unwind_vrs_get" instead of "_Unwind_VRS_Get". */
static inline _Unwind_Word
unwind_getgr (_Unwind_Context *context, int regno)
{
_uw val;
unwind_vrs_get (context, _UVRSC_CORE, regno, _UVRSD_UINT32, &val);
return val;
}
/* This macro is identical to the _Unwind_GetIP macro, except that it
uses "unwind_getgr" instead of "_Unwind_GetGR". */
# define unwind_getip(context) \
(unwind_getgr (context, 15) & ~(_Unwind_Word)1)
#else
# define unwind_backtrace _Unwind_Backtrace
# define unwind_getip _Unwind_GetIP
#endif
static _Unwind_Reason_Code
backtrace_helper (struct _Unwind_Context *ctx, void *a)
{
struct trace_arg *arg = a;
/* We are first called with address in the __backtrace function.
Skip it. */
if (arg->cnt != -1)
arg->array[arg->cnt] = (void *) unwind_getip (ctx);
if (++arg->cnt == arg->size)
return _URC_END_OF_STACK;
return _URC_NO_REASON;
}
int
backtrace (array, size)
void **array;
int size;
{
struct trace_arg arg = { .array = array, .size = size, .cnt = -1 };
#ifdef SHARED
__libc_once_define (static, once);
__libc_once (once, init);
if (unwind_backtrace == NULL)
return 0;
#endif
if (size >= 1)
unwind_backtrace (backtrace_helper, &arg);
if (arg.cnt > 1 && arg.array[arg.cnt - 1] == NULL)
--arg.cnt;
return arg.cnt != -1 ? arg.cnt : 0;
}
void
backtrace_symbols (array, size, result, max_len)
void *const *array;
int size;
char **result;
int max_len;
{
Dl_info info[size];
int status[size];
int cnt;
size_t total = 0;
/* Fill in the information we can get from `dladdr'. */
for (cnt = 0; cnt < size; ++cnt)
{
struct link_map *map;
status[cnt] = _dl_addr (array[cnt], &info[cnt], &map, NULL);
if (status[cnt] && info[cnt].dli_fname && info[cnt].dli_fname[0] != '\0')
{
/* We have some info, compute the length of the string which will be
"(+offset) [address]. */
total += (strlen (info[cnt].dli_fname ?: "")
+ strlen (info[cnt].dli_sname ?: "")
+ 3 + WORD_WIDTH + 3 + WORD_WIDTH + 5);
/* The load bias is more useful to the user than the load
address. The use of these addresses is to calculate an
address in the ELF file, so its prelinked bias is not
something we want to subtract out. */
info[cnt].dli_fbase = (void *) map->l_addr;
}
else
total += 5 + WORD_WIDTH;
}
if (result != NULL)
{
char *last = (char *) (result + size);
for (cnt = 0; cnt < size; ++cnt)
{
result[cnt] = last;
if (status[cnt]
&& info[cnt].dli_fname != NULL && info[cnt].dli_fname[0] != '\0')
{
if (info[cnt].dli_sname == NULL)
/* We found no symbol name to use, so describe it as
relative to the file. */
info[cnt].dli_saddr = info[cnt].dli_fbase;
if (info[cnt].dli_sname == NULL && info[cnt].dli_saddr == 0)
last += 1 + sprintf (last, "%s(%s) [%p]",
info[cnt].dli_fname ?: "",
info[cnt].dli_sname ?: "",
array[cnt]);
else
{
char sign;
long int offset;
if (array[cnt] >= (void *) info[cnt].dli_saddr)
{
sign = '+';
offset = array[cnt] - info[cnt].dli_saddr;
}
else
{
sign = '-';
offset = info[cnt].dli_saddr - array[cnt];
}
last += 1 + sprintf (last, "%s(%s%c%#tx) [%p]",
info[cnt].dli_fname ?: "",
info[cnt].dli_sname ?: "",
sign, offset, array[cnt]);
}
}
else
last += 1 + sprintf (last, "[%p]", array[cnt]);
}
assert (last <= (char *) result + max_len);
}
return;
}
#ifdef SHARED
/* Free all resources if necessary. */
libc_freeres_fn (free_mem)
{
unwind_backtrace = NULL;
if (libgcc_handle != NULL)
{
__libc_dlclose (libgcc_handle);
libgcc_handle = NULL;
}
}
#endif