Core dump实战分析(Java版)
背景
- 项目中的battleserver进程在某一段时间总是crash,无法找到具体Crash原因
- Java通过JNI调用Luajit
- 那么进程Crash如何找到JNI的堆栈(C层),进而确认底层问题呢?
Crash分析之hs_error
-
Java进程Crash后通常会生成一个hs_error%pid.log
- %p指进程的pid
- hs_error指HotSpot JVM error
-
该文件的位置可通过JVM参数'-XX:ErrorFile'指定如
-XX:ErrorFile=/landon/business/battle/hs_error%p.log
-
一个典型的hs_error文件内容如下
-
第一部分文件头
- Crash原因:SIGSEGV (0xb)
- Problematic frame:C,即堆栈出现在了C层(native code)
- Core dump已经写入,如果没有写入,会提示
Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
- 如果要开启core dump,一定要打开ulimit -c参数
# # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f838c76f1ae, pid=8257, tid=0x00007f838b5b6700 # # JRE version: Java(TM) SE Runtime Environment (8.0_221-b11) (build 1.8.0_221-b11) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.221-b11 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libluajit-5.1.so.2+0xc1ae] # # Core dump written. Default location: /landon/battle_server/online/core or core.8257
-
第二部分是Crash的线程信息
- 这个看到出问题的线程是SOFA-SEV-BOLT-BIZ-12200-3-T11
- _thread_in_native表示在执行native代码
- 中间一部分是寄存器上下文
- 从这里可以看到各寄存器的内存映射,从libjnlua5.1.so到ibjvm.so
- RSP堆栈指针指向了当前线程
- 最后一部分是堆栈信息
- 从Java frames到Native frames
- 从这里发现:无法看到Native frames
--------------- T H R E A D --------------- Current thread (0x00007f83a001e000): JavaThread "SOFA-SEV-BOLT-BIZ-12200-3-T11" [_thread_in_native, id=27748, stack(0x00007f838b4b6000,0x00007f838b5b7000)] siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000004 Registers: Top of Stack: (sp=0x00007f838b5b4f70) Instructions Register to memory mapping: RAX=0x0000000000000000 is an unknown value RBX=0x0000000040ffb9c0 is an unknown value RCX=0x0000000000000000 is an unknown value RDX=0x00000000489f80d0 is an unknown value RSP=0x00007f838b5b4f70 is pointing into the stack for thread: 0x00007f83a001e000 RBP=0x00000000407c23b8 is an unknown value RSI=0x0000000041cb0e88 is an unknown value RDI=0x00000000407c23b8 is an unknown value R8 =0x00000007bfc84540 is an oop com.naef.jnlua.LuaState - klass: 'com/naef/jnlua/LuaState' R9 =0x00000007bfc84540 is an oop com.naef.jnlua.LuaState - klass: 'com/naef/jnlua/LuaState' R10=0x00000000000006ba is an unknown value R11=0x00007f843da2cf3c:
in /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so at 0x00007f843ca60000 R12=0x0000000000000000 is an unknown value R13=0x00007f838b5b50d0 is pointing into the stack for thread: 0x00007f83a001e000 R14=0x00000000407c2fa8 is an unknown value R15=0x00007f838c9e43e0: in /landon/lib/libjnlua5.1.so at 0x00007f838c9da000 Stack: [0x00007f838b4b6000,0x00007f838b5b7000], sp=0x00007f838b5b4f70, free space=1019k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [libluajit-5.1.so.2+0xc1ae] Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) J 4447 com.naef.jnlua.LuaState.lua_gc(II)I (0 bytes) @ 0x00007f84291dd278 [0x00007f84291dd240+0x38] J 5983 C2 com.landon30.jlua.pool.JLua.call(Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/Object; (206 bytes) @ 0x00007f842946e01c [0x00007f842946dbe0+0x43c] J 6040 C1 com.landon30.jlua.LuaCallManager.callLua(Ljava/util/Set;Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/Object; (474 bytes) @ 0x00007f842965fd1c [0x00007f842965f4e0+0x83c] -
第三,四部分主要是Crash时进程和系统信息
- 包括所有的Java线程以及Current thread
- 还包括Java的堆信息等
- 最后是系统信息
--------------- P R O C E S S --------------- Java Threads: ( => current thread ) =>0x00007f83a001e000 JavaThread "SOFA-SEV-BOLT-BIZ-12200-3-T11" [_thread_in_native, id=27748, stack(0x00007f838b4b6000,0x00007f838b5b7000)] VM state:not at safepoint (normal execution) VM Mutex/Monitor currently owned by a thread: None Heap: Card table byte_map: Marking Bits: Polling page: CodeCache: Compilation events: GC Heap History: Deoptimization events: Classes redefined: Internal exceptions: Events: Dynamic libraries: VM Arguments: Environment Variables: Signal Handlers: --------------- S Y S T E M ---------------
-
-
hs_error总结
- 通常在Java层面的crash,直接通过hs_error就可以直接定位,因为有堆栈信息
- 不过如果是C层面的Crash,则无堆栈
- 此时则需要分析Core dump了,主要前提是要打开ulimit -c
Crash分析之Core dump
-
通过上面的分析,打开ulimit -c后就会生成Core dump
6.9G Nov 2 21:58 core.26972
- 注:因Coredump文件较大,所以通常在进程Crash的写Coredump时cpu load和iowait都较高
-
分析Core dump的一个工具是gdb,所以必须安装gdb
$ gdb --version GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6)
-
分析Java进程crash生成的coredump
- 用gdb打开
$ gdb /landon/lib/jdk/bin/java core.26972
-
输入bt或bt full命令查看backtrace
- 从这里就可以看到一些实际的C堆栈
- 如#15 0x00007f21149fcc3d in resizestack (L=0x2, n=0) at lj_state.c:71
(gdb) bt #0 0x00007f2169c704f5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007f2169c71cd5 in abort () at abort.c:92 #2 0x00007f2169560799 in os::abort(bool) () from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so #3 0x00007f2169725733 in VMError::report_and_die() () from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so #4 0x00007f216956aa45 in JVM_handle_linux_signal () from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so #5 0x00007f216955d8e8 in signalHandler(int, siginfo*, void*) () from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so #6
#7 0x00007f21380cbb8d in ?? () #8 0x0000000000000001 in ?? () #9 0x000000005d5fdfa8 in ?? () #10 0x000000005d696b20 in ?? () #11 0x00007f21149edd03 in lj_cont_ra () from /landon/lib/libluajit-5.1.so.2 #12 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2 #13 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2 #14 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2 #15 0x00007f21149fcc3d in resizestack (L=0x2, n=0) at lj_state.c:71 #16 0x00007f2155595eaa in ?? () #17 0x0000000669220970 in ?? () #18 0x000000075f554730 in ?? () #19 0x0000000000000000 in ?? () -
btfull命令可以看到详细的信息包括调用参数等
- 此时结合源代码如 lj_state.c:71 + 参数分析原因
(gdb) bt full #0 0x00007f2169c704f5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 resultvar = 0 pid =
selftid = #1 0x00007f2169c71cd5 in abort () at abort.c:92 save_stage = 2 act = {__sigaction_handler = {sa_handler = 0x7f20b70ede80, sa_sigaction = 0x7f20b70ede80}, sa_mask = {__val = { 139779442242560, 0, 139781486046377, 0, 139781485559377, 139781480345909, 5872491825372126914, 1925, 335544324, 139781480145399, 10, 139781480027928, 55, 1, 0, 0}}, sa_flags = 0, sa_restorer = 0x260} sigs = {__val = {32, 0 }} #2 0x00007f2169560799 in os::abort(bool) () from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so No symbol table info available. #3 0x00007f2169725733 in VMError::report_and_die() () from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so No symbol table info available. #4 0x00007f216956aa45 in JVM_handle_linux_signal () from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so No symbol table info available. #5 0x00007f216955d8e8 in signalHandler(int, siginfo*, void*) () from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so No symbol table info available. #6 No symbol table info available. #7 0x00007f21380cbb8d in ?? () No symbol table info available. #8 0x0000000000000001 in ?? () No symbol table info available. #9 0x000000005d5fdfa8 in ?? () No symbol table info available. #10 0x000000005d696b20 in ?? () No symbol table info available. #11 0x00007f21149edd03 in lj_cont_ra () from /landon/lib/libluajit-5.1.so.2 No symbol table info available. #12 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2 No symbol table info available. ---Type to continue, or q to quit--- #13 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2 No symbol table info available. #14 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2 No symbol table info available. #15 0x00007f21149fcc3d in resizestack (L=0x2, n=0) at lj_state.c:71 st = 0x7f2114c63f66 oldst = 0x7f20b70ef370 delta = 0 oldsize = 3071210352 realsize = 32544 up = 0x0 #16 0x00007f2155595eaa in ?? () No symbol table info available. #17 0x0000000669220970 in ?? () No symbol table info available. #18 0x000000075f554730 in ?? () No symbol table info available. #19 0x0000000000000000 in ?? () No symbol table info available. 按q退出
gdb还有许多其他命令,这里就不一一细说
-
使用jvm原生命令分析coredump
-
使用jstack分析
$ jstack /landon/lib/jdk/bin/java core.30298 >> core.30298.jstack.log
- 从这里可以看到所有的线程堆栈,而这个是hs_error没有的,hs_error只能看到当前线程堆栈和所有的线程
Deadlock Detection: No deadlocks found. Thread 30402: (state = IN_NATIVE) - com.naef.jnlua.LuaState.lua_pcall(int, int) @bci=0 (Compiled frame; information may be imprecise) - com.naef.jnlua.LuaState.call(int, int) @bci=7, line=708 (Compiled frame) - com.landon30.JLua.call(java.lang.String, java.lang.Object[]) @bci=53, line=46 (Compiled frame) - com.landon30.jlua.LuaCallManager.callLua(java.util.Set, java.lang.String, java.lang.Object[]) @bci=31, line=104 (Compiled frame) - com.landon30.jlua.LuaCallManager.callLua(java.lang.String, java.lang.String, java.lang.String, java.lang.Object[]) @bci=12, line=192 (Compiled frame) .......
-
使用jmap分析
-
$ jmap /landon/lib/jdk/bin/java core.30298
... 0x00007f274a5eb000 55K /landon/lib/libjnlua5.1.so 0x00007f274a374000 474K /lanodn/lib/libluajit-5.1.so.2 ...
-
$ jmap -dump:live,format=b,file=30298.hprof /landon/lib/jdk/bin/java core.30298
- 即将coredump文件转为hprof,这样就可以用如mat工具等分析内存相关了
- 此执行时间较长,线上慎用
-
-
注:其实jstack和jmap的help文档都有说明分析coredump
$ jstack --help jstack [-m] [-l]
(to connect to a core file) $ jmap --help jmap [option] (to connect to a core file)
-
关于调试信息
-
从gdb分析堆栈时,经常看到一些这样的信息 'No symbol table info available.'
没有可用符号表
-
其实简单来说就是调试信息,没有调试信息就无法看到详细的堆栈
-
这里延伸到Java,javac编译的时候
-g:none Generate no debugging info -g:{lines,vars,source} Generate only some debugging info
-
-
如何看一个so库是否有调试信息呢
-
使用nm命令:List symbols in [file(s)]
// 无符号表的一个so $ nm -a libjnlua5.1.so_no_debug nm: libjnlua5.1.so_no_debug: no symbols $ nm -a libjnlua5.1.so 000000000020f818 b .bss 0000000000000000 n .comment ...... // 查看行号 $ nm -l libjnlua5.1.so | head 0000000000009a50 T JNI_OnLoad /laodon/lib/JNLuaJIT-master/src/main/c/Linux/../jnlua.c:1911
-
直接使用gdb命令即可
$ gdb libjnlua5.1.so // 如果有符号表,会直接读取成功,否则 Reading symbols from /landon/lib/libjnlua5.1.so...done. // 没有则直接提示 no debugging symbols found Reading symbols from /landon/lib/libjnlua5.1.so_no_debug...(no debugging symbols found)...done.
-
还有一个很重要的是要查看文件是否被stripped,如果被stripped,则一定没有调试信息
- 使用file命令查看即可
// 这个显示stripped,则一定没有调试信息 $ file libjnlua5.1.so_no_debug libjnlua5.1.so_no_debug: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, stripped // 显示not stripped $ file libjnlua5.1.so libjnlua5.1.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, not stripped
-
-
如何生成调试信息:以三个例子说明
- LuaJIT-2.1.0-beta2
$ make CCDEBUG=-g CFLAGS=-O0
- 注:要指定-O0,否则gdb的bt full会显示'value optimized out''
- JNLuaJIT
- 进入src/main/c/Linux目录,编辑Makefile
- 之前的LDFLAGS是用-s选项,而-s就是之前的说的stripped,将其替换为-g
- 另外CFLAGS也增加-g,可以选择的增加-O0
- 重新make编译
- lua cjson
- 编辑Makefile
- 打开调试信息的CFLAGS
-g -Wall -pedantic -fno-inline
- 重新编译后,都可以用nm命令和gdb验证是否生成符号信息
- LuaJIT-2.1.0-beta2
-
gdb提示
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.212.el6_10.3.x86_64 libgcc-4.4.7-23.el6.x86_64 ncurses-libs-5.7-4.20090207.el6.x86_64
按照提示直接安装debuginfo即可
-
安装后,gcc等底层库的调试信息也都有了
Reading symbols from /lib64/libgcc_s.so.1...Reading symbols from /usr/lib/debug/lib64/libgcc_s-4.4.7-20120601.so.1.debug...done. done.
battleserver crash的原因
- 同一个luastate多线程使用时crash
- 某些Lua代码在luajit模式执行下会变慢,也会crash
- 所以在服务器执行battle复盘的时候可以尝试关闭luajit模式,只有luajit的解释模式