触发JVM fatal error并配置相关JVM参数

1. 絮絮叨叨

  • 工作中,Java服务因为fatal error(致命错误,笔者称其为jvm crash),在服务运行日志中出现了致命错误的概要信息:

    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGSEGV (0xb) at pc=0x000000010a7d52e8, pid=47989, tid=11011
    #
    # JRE version: OpenJDK Runtime Environment Temurin-17.0.6+10 (17.0.6+10) (build 17.0.6+10)
    # Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, bsd-amd64)
    # Problematic frame:
    # V  [libjvm.dylib+0xada2e8]  Unsafe_GetByte(JNIEnv_*, _jobject*, _jobject*, long)+0xd8
    #
    # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
    #
    # An error report file with more information is saved as:
    # /Users/xxx/IdeaProjects/study/hs_err_pid47989.log
    #
    # If you would like to submit a bug report, please visit:
    #   https://github.com/adoptium/adoptium-support/issues
    #
    
  • 服务运行在k8s中,由于未提前设置fatal error日志的路径(挂载到宿主机目录),容器重启后该日志会丢失,无法深入排查原因

  • 因此,需要查询jvm的配置,将fatal error日志写入指定目录,保证该日志持久化存储到宿主机磁盘

2. 配置jvm参数,实现日志的持久化存储

2.1 -XX:ErrorFile配置fatal error路径

  • 通过查阅资料,了解到可以通过-XX:ErrorFile=filename配置hs_err日志的路径

  • 下面的示例中,将fatal error的日志写入指定目录,文件名的%p会动态替换成改Java程序的PID(进程id)

    -XX:ErrorFile=/var/log/java/java_error%p.log
    
  • 默认将fatal error日志写入Java程序的working directory,且文件名为hs_err_pid.log;如果空间不足、权限不够等原因,fatal error日志将被写入系统的临时目录

  • 详情见JDK官网的说明:

    • JDK 8:A Fatal Error Log
    • JDK 17:A Fatal Error Log,Command-Line Options

2.2 笔者的错误配置

  • 考虑到服务每次重启的pid基本一致,如果多次出现fatal error,只使用pid的日志会被覆盖。

  • 笔者结合之前配置heap dump的经验,添加了%t以生成类似2023-08-16_23-33-08的时间戳

    -XX:ErrorFile=/data_path/var/log/hs_err_pid%p_%t.log
    
  • 当再一次发生fatal error时,发现日志文件名为hs_err_pid6_%t.log,即%t未按照预期进行解析

2.3 -XX:OnError配置更新文件名

  • 受问题(How to specify a unique name for the JVM crash log files?)启发,配置-XX:OnError:在日志生成后,执行shell命令为其添加时间戳

    -XX:ErrorFile=/data_path/var/log/hs_err.log
    -XX:OnError="time=`date +%Y%m%d_%H%M%S` && mv /data_path/var/log/hs_err.log /data_path/var/log/hs_err_\${time}.log"
    

3. 如何触发fatal error?

  • 不管是验证相关JVM参数的配置,还是学习查看fatal error日志的内容,学会如何在触发fatal error是非常必要的

  • 参考:Write Java code to crash the java virtual machine,通过如下代码可以成功在本地触发fatal error

    import sun.misc.Unsafe;
    import java.lang.reflect.Field;
    
    public class CrashTest {
        public static void main(String... args) throws Exception {
            getUnsafe().getByte(0);
        }
    
        private static Unsafe getUnsafe() throws NoSuchFieldException, IllegalAccessException {
            Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
            theUnsafe.setAccessible(true);
            return (Unsafe) theUnsafe.get(null);
        }
    }
    

4. 待交流的问题

4.1 本地验证OK

  • 按照上面的描述,笔者为CrashTest配置了如下JVM参数

    -XX:ErrorFile=/data_path/study/hs_err.log
    -XX:OnError="time=`date +%Y%m%d_%H%M%S` && echo $time && mv /data_path/hs_err.log /data_path//hs_err_${time}.log"
    
  • 程序运行起来后,打印如下信息:

    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGSEGV (0xb) at pc=0x000000010a49e2e8, pid=56245, tid=11011
    #
    # JRE version: OpenJDK Runtime Environment Temurin-17.0.6+10 (17.0.6+10) (build 17.0.6+10)
    # Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, bsd-amd64)
    # Problematic frame:
    # V  [libjvm.dylib+0xada2e8]  Unsafe_GetByte(JNIEnv_*, _jobject*, _jobject*, long)+0xd8
    #
    # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
    #
    # An error report file with more information is saved as:
    # /data_path/hs_err.log
    #
    # If you would like to submit a bug report, please visit:
    #   https://github.com/adoptium/adoptium-support/issues
    #
    #
    # -XX:OnError="time=`date +%Y%m%d_%H%M%S` && mv /data_path/hs_err.log /data_path/hs_err_${time}.log"
    #   Executing /bin/sh -c "time=`date +%Y%m%d_%H%M%S` && mv /data_path/hs_err.log /data_path/hs_err_${time}.log" ...
    
  • 最终,fatal error日志的文件名为hs_err_20230827_202458.log符合预期

4.2 测试环境验证失败

  • 将此配置移动到线上服务,却发现fatal error日志的文件名为hs_err_.log不符合预期

  • 怀疑:未能正确解析${time}

  • 一个问答: How to add the timestamp of the fatal error occurrence to Java fatal error log filename,遇到了与笔者类似的问题

    -XX:ErrorFile={{ .Values.server.data_dir }}/var/log/hs_err.log
    -XX:OnError="mv {{ .Values.server.data_dir }}/var/log/hs_err.log {{ .Values.server.data_dir }}/var/log/hs_err_\$(date +%Y%m%d_%H%M%S).log"
    
  • 虽然更新了配置,但是由于引发fatal error的错误已被修复,无法验证该配置的效果

  • 要么等到后面出现fatal error时验证效果,要么回退镜像版本触发fatal error

  • 若后续有机会验证该配置,笔者会更新结果,暂时在此记录可能的可行解决方案

你可能感兴趣的:(jvm)