Core dump实战分析之Java版

Core dump实战分析(Java版)

背景

  1. 项目中的battleserver进程在某一段时间总是crash,无法找到具体Crash原因
    • Java通过JNI调用Luajit
  2. 那么进程Crash如何找到JNI的堆栈(C层),进而确认底层问题呢?

Crash分析之hs_error

  1. Java进程Crash后通常会生成一个hs_error%pid.log

    • %p指进程的pid
    • hs_error指HotSpot JVM error
  2. 该文件的位置可通过JVM参数'-XX:ErrorFile'指定如

    • -XX:ErrorFile=/landon/business/battle/hs_error%p.log
  3. 一个典型的hs_error文件内容如下

    • 第一部分文件头

      • Crash原因:SIGSEGV (0xb)
      • Problematic frame:C,即堆栈出现在了C层(native code)
      • Core dump已经写入,如果没有写入,会提示
        • Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
        • 如果要开启core dump,一定要打开ulimit -c参数
      #
      # A fatal error has been detected by the Java Runtime Environment:
      #
      #  SIGSEGV (0xb) at pc=0x00007f838c76f1ae, pid=8257, tid=0x00007f838b5b6700
      #
      # JRE version: Java(TM) SE Runtime Environment (8.0_221-b11) (build 1.8.0_221-b11)
      # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.221-b11 mixed mode linux-amd64 compressed oops)
      # Problematic frame:
      # C  [libluajit-5.1.so.2+0xc1ae]
      #
      # Core dump written. Default location: /landon/battle_server/online/core or core.8257
      
    • 第二部分是Crash的线程信息

      • 这个看到出问题的线程是SOFA-SEV-BOLT-BIZ-12200-3-T11
      • _thread_in_native表示在执行native代码
      • 中间一部分是寄存器上下文
        • 从这里可以看到各寄存器的内存映射,从libjnlua5.1.so到ibjvm.so
        • RSP堆栈指针指向了当前线程
      • 最后一部分是堆栈信息
        • 从Java frames到Native frames
        • 从这里发现:无法看到Native frames
      ---------------  T H R E A D  ---------------
      
      Current thread (0x00007f83a001e000):  JavaThread "SOFA-SEV-BOLT-BIZ-12200-3-T11" [_thread_in_native, id=27748, stack(0x00007f838b4b6000,0x00007f838b5b7000)]
      
      siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000004
      
      Registers:
      Top of Stack: (sp=0x00007f838b5b4f70)
      Instructions
      
      Register to memory mapping:
      
      RAX=0x0000000000000000 is an unknown value
      RBX=0x0000000040ffb9c0 is an unknown value
      RCX=0x0000000000000000 is an unknown value
      RDX=0x00000000489f80d0 is an unknown value
      RSP=0x00007f838b5b4f70 is pointing into the stack for thread: 0x00007f83a001e000
      RBP=0x00000000407c23b8 is an unknown value
      RSI=0x0000000041cb0e88 is an unknown value
      RDI=0x00000000407c23b8 is an unknown value
      R8 =0x00000007bfc84540 is an oop
      com.naef.jnlua.LuaState 
       - klass: 'com/naef/jnlua/LuaState'
      R9 =0x00000007bfc84540 is an oop
      com.naef.jnlua.LuaState 
       - klass: 'com/naef/jnlua/LuaState'
      R10=0x00000000000006ba is an unknown value
      R11=0x00007f843da2cf3c:  in /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so at 0x00007f843ca60000
      R12=0x0000000000000000 is an unknown value
      R13=0x00007f838b5b50d0 is pointing into the stack for thread: 0x00007f83a001e000
      R14=0x00000000407c2fa8 is an unknown value
      R15=0x00007f838c9e43e0:  in /landon/lib/libjnlua5.1.so at 0x00007f838c9da000
      
      
      Stack: [0x00007f838b4b6000,0x00007f838b5b7000],  sp=0x00007f838b5b4f70,  free space=1019k
      Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
      C  [libluajit-5.1.so.2+0xc1ae]
      
      Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
      J 4447  com.naef.jnlua.LuaState.lua_gc(II)I (0 bytes) @ 0x00007f84291dd278 [0x00007f84291dd240+0x38]
      J 5983 C2 com.landon30.jlua.pool.JLua.call(Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/Object; (206 bytes) @ 0x00007f842946e01c [0x00007f842946dbe0+0x43c]
      J 6040 C1 com.landon30.jlua.LuaCallManager.callLua(Ljava/util/Set;Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/Object; (474 bytes) @ 0x00007f842965fd1c [0x00007f842965f4e0+0x83c]
      
    • 第三,四部分主要是Crash时进程和系统信息

      • 包括所有的Java线程以及Current thread
      • 还包括Java的堆信息等
      • 最后是系统信息
      ---------------  P R O C E S S  ---------------
      
      Java Threads: ( => current thread )
      =>0x00007f83a001e000 JavaThread "SOFA-SEV-BOLT-BIZ-12200-3-T11" [_thread_in_native, id=27748, stack(0x00007f838b4b6000,0x00007f838b5b7000)]
      
      VM state:not at safepoint (normal execution)
      VM Mutex/Monitor currently owned by a thread: None
      
      Heap:
      Card table byte_map:
      Marking Bits:
      Polling page:
      CodeCache:
      
      Compilation events:
      GC Heap History:
      Deoptimization events:
      Classes redefined:
      Internal exceptions:
      Events:
      
      Dynamic libraries:
      
      VM Arguments:
      Environment Variables:
      Signal Handlers:
      
      ---------------  S Y S T E M  ---------------
      
  4. hs_error总结

    • 通常在Java层面的crash,直接通过hs_error就可以直接定位,因为有堆栈信息
    • 不过如果是C层面的Crash,则无堆栈
    • 此时则需要分析Core dump了,主要前提是要打开ulimit -c

Crash分析之Core dump

  1. 通过上面的分析,打开ulimit -c后就会生成Core dump

    • 6.9G Nov 2 21:58 core.26972
    • 注:因Coredump文件较大,所以通常在进程Crash的写Coredump时cpu load和iowait都较高
  2. 分析Core dump的一个工具是gdb,所以必须安装gdb

    $ gdb --version
    GNU gdb (GDB) Red Hat Enterprise Linux (7.2-92.el6)
    
  3. 分析Java进程crash生成的coredump

    1. 用gdb打开
    • $ gdb /landon/lib/jdk/bin/java core.26972
    1. 输入bt或bt full命令查看backtrace

      • 从这里就可以看到一些实际的C堆栈
      • 如#15 0x00007f21149fcc3d in resizestack (L=0x2, n=0) at lj_state.c:71
      (gdb) bt
      #0  0x00007f2169c704f5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
      #1  0x00007f2169c71cd5 in abort () at abort.c:92
      #2  0x00007f2169560799 in os::abort(bool) () from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
      #3  0x00007f2169725733 in VMError::report_and_die() ()
         from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
      #4  0x00007f216956aa45 in JVM_handle_linux_signal ()
         from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
      #5  0x00007f216955d8e8 in signalHandler(int, siginfo*, void*) ()
         from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
      #6  
      #7  0x00007f21380cbb8d in ?? ()
      #8  0x0000000000000001 in ?? ()
      #9  0x000000005d5fdfa8 in ?? ()
      #10 0x000000005d696b20 in ?? ()
      #11 0x00007f21149edd03 in lj_cont_ra () from /landon/lib/libluajit-5.1.so.2
      #12 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
      #13 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
      #14 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
      #15 0x00007f21149fcc3d in resizestack (L=0x2, n=0) at lj_state.c:71
      #16 0x00007f2155595eaa in ?? ()
      #17 0x0000000669220970 in ?? ()
      #18 0x000000075f554730 in ?? ()
      #19 0x0000000000000000 in ?? ()
      
    2. btfull命令可以看到详细的信息包括调用参数等

      • 此时结合源代码如 lj_state.c:71 + 参数分析原因
      (gdb) bt full
      #0  0x00007f2169c704f5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
              resultvar = 0
              pid = 
              selftid = 
      #1  0x00007f2169c71cd5 in abort () at abort.c:92
              save_stage = 2
              act = {__sigaction_handler = {sa_handler = 0x7f20b70ede80, sa_sigaction = 0x7f20b70ede80}, sa_mask = {__val = {
                    139779442242560, 0, 139781486046377, 0, 139781485559377, 139781480345909, 5872491825372126914, 1925, 335544324, 
                    139781480145399, 10, 139781480027928, 55, 1, 0, 0}}, sa_flags = 0, sa_restorer = 0x260}
              sigs = {__val = {32, 0 }}
      #2  0x00007f2169560799 in os::abort(bool) () from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
      No symbol table info available.
      #3  0x00007f2169725733 in VMError::report_and_die() ()
         from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
      No symbol table info available.
      #4  0x00007f216956aa45 in JVM_handle_linux_signal ()
         from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
      No symbol table info available.
      #5  0x00007f216955d8e8 in signalHandler(int, siginfo*, void*) ()
         from /landon/lib/jdk1.8.0_221/jre/lib/amd64/server/libjvm.so
      No symbol table info available.
      #6  
      No symbol table info available.
      #7  0x00007f21380cbb8d in ?? ()
      No symbol table info available.
      #8  0x0000000000000001 in ?? ()
      No symbol table info available.
      #9  0x000000005d5fdfa8 in ?? ()
      No symbol table info available.
      #10 0x000000005d696b20 in ?? ()
      No symbol table info available.
      #11 0x00007f21149edd03 in lj_cont_ra () from /landon/lib/libluajit-5.1.so.2
      No symbol table info available.
      #12 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
      No symbol table info available.
      ---Type  to continue, or q  to quit---
      #13 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
      No symbol table info available.
      #14 0x00007f21149ee456 in lj_ff_coroutine_resume () from /landon/lib/libluajit-5.1.so.2
      No symbol table info available.
      #15 0x00007f21149fcc3d in resizestack (L=0x2, n=0) at lj_state.c:71
              st = 0x7f2114c63f66
              oldst = 0x7f20b70ef370
              delta = 0
              oldsize = 3071210352
              realsize = 32544
              up = 0x0
      #16 0x00007f2155595eaa in ?? ()
      No symbol table info available.
      #17 0x0000000669220970 in ?? ()
      No symbol table info available.
      #18 0x000000075f554730 in ?? ()
      No symbol table info available.
      #19 0x0000000000000000 in ?? ()
      No symbol table info available.
      
    3. 按q退出

    4. gdb还有许多其他命令,这里就不一一细说

  4. 使用jvm原生命令分析coredump

    1. 使用jstack分析

      • $ jstack /landon/lib/jdk/bin/java core.30298 >> core.30298.jstack.log
      • 从这里可以看到所有的线程堆栈,而这个是hs_error没有的,hs_error只能看到当前线程堆栈和所有的线程
      Deadlock Detection:
      No deadlocks found.
      
      Thread 30402: (state = IN_NATIVE)
       - com.naef.jnlua.LuaState.lua_pcall(int, int) @bci=0 (Compiled frame; information may be imprecise)
       - com.naef.jnlua.LuaState.call(int, int) @bci=7, line=708 (Compiled frame)
       - com.landon30.JLua.call(java.lang.String, java.lang.Object[]) @bci=53, line=46 (Compiled frame)
       - com.landon30.jlua.LuaCallManager.callLua(java.util.Set, java.lang.String, java.lang.Object[]) @bci=31, line=104 (Compiled frame)
       - com.landon30.jlua.LuaCallManager.callLua(java.lang.String, java.lang.String, java.lang.String, java.lang.Object[]) @bci=12, line=192 (Compiled frame)
       
      .......
      
    2. 使用jmap分析

      • $ jmap /landon/lib/jdk/bin/java core.30298

        ...
        0x00007f274a5eb000      55K     /landon/lib/libjnlua5.1.so
        0x00007f274a374000      474K    /lanodn/lib/libluajit-5.1.so.2
        ...
        
      • $ jmap -dump:live,format=b,file=30298.hprof /landon/lib/jdk/bin/java core.30298

        • 即将coredump文件转为hprof,这样就可以用如mat工具等分析内存相关了
        • 此执行时间较长,线上慎用
    3. 注:其实jstack和jmap的help文档都有说明分析coredump

      $ jstack --help
      jstack [-m] [-l]  
              (to connect to a core file)
              
      $ jmap --help
      jmap [option] 
              (to connect to a core file)
      

关于调试信息

  1. 从gdb分析堆栈时,经常看到一些这样的信息 'No symbol table info available.'

    • 没有可用符号表

    • 其实简单来说就是调试信息,没有调试信息就无法看到详细的堆栈

      • 这里延伸到Java,javac编译的时候

        -g:none                    Generate no debugging info
        -g:{lines,vars,source}     Generate only some debugging info
        
  2. 如何看一个so库是否有调试信息呢

    • 使用nm命令:List symbols in [file(s)]

      // 无符号表的一个so
      $ nm -a libjnlua5.1.so_no_debug
      nm: libjnlua5.1.so_no_debug: no symbols
      
      $ nm -a libjnlua5.1.so
      000000000020f818 b .bss
      0000000000000000 n .comment
      ......
      
      // 查看行号
      $ nm -l libjnlua5.1.so | head
      0000000000009a50 T JNI_OnLoad   /laodon/lib/JNLuaJIT-master/src/main/c/Linux/../jnlua.c:1911
      
    • 直接使用gdb命令即可

      $ gdb libjnlua5.1.so
      
      // 如果有符号表,会直接读取成功,否则
      Reading symbols from /landon/lib/libjnlua5.1.so...done.
      
      // 没有则直接提示 no debugging symbols found
      Reading symbols from /landon/lib/libjnlua5.1.so_no_debug...(no debugging symbols found)...done.
      
    • 还有一个很重要的是要查看文件是否被stripped,如果被stripped,则一定没有调试信息

      • 使用file命令查看即可
      // 这个显示stripped,则一定没有调试信息
      $ file libjnlua5.1.so_no_debug 
      libjnlua5.1.so_no_debug: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, stripped
      
      // 显示not stripped
      $ file libjnlua5.1.so
      libjnlua5.1.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, not stripped
      
  3. 如何生成调试信息:以三个例子说明

    • LuaJIT-2.1.0-beta2
      • $ make CCDEBUG=-g CFLAGS=-O0
      • 注:要指定-O0,否则gdb的bt full会显示'value optimized out''
    • JNLuaJIT
      • 进入src/main/c/Linux目录,编辑Makefile
      • 之前的LDFLAGS是用-s选项,而-s就是之前的说的stripped,将其替换为-g
      • 另外CFLAGS也增加-g,可以选择的增加-O0
      • 重新make编译
    • lua cjson
      • 编辑Makefile
      • 打开调试信息的CFLAGS -g -Wall -pedantic -fno-inline
    • 重新编译后,都可以用nm命令和gdb验证是否生成符号信息
  4. gdb提示Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.212.el6_10.3.x86_64 libgcc-4.4.7-23.el6.x86_64 ncurses-libs-5.7-4.20090207.el6.x86_64

    • 按照提示直接安装debuginfo即可

    • 安装后,gcc等底层库的调试信息也都有了

      Reading symbols from /lib64/libgcc_s.so.1...Reading symbols from /usr/lib/debug/lib64/libgcc_s-4.4.7-20120601.so.1.debug...done.
      done.
      

battleserver crash的原因

  1. 同一个luastate多线程使用时crash
  2. 某些Lua代码在luajit模式执行下会变慢,也会crash
    • 所以在服务器执行battle复盘的时候可以尝试关闭luajit模式,只有luajit的解释模式

你可能感兴趣的:(Core dump实战分析之Java版)