记一次很奇怪的 Java 进程Hang住的故障(未解)

一次同事反馈, 说线上访问某个存储的 Spark Executor 进程卡主不动了, 本该几分钟计算完成的任务, 跑了一个多小时都没有完成. 上去看 stacktrace 如下:

"Executor task launch worker-1" daemon prio=10 tid=0x00007fee81cee000 nid=0x2bfa in Object.wait() [0x00007fee63113000]
   java.lang.Thread.State: RUNNABLE
    at com.amazonaws.auth.internal.AWS4SignerRequestParams.(AWS4SignerRequestParams.java:85)
    at com.amazonaws.auth.AWS4Signer.sign(AWS4Signer.java:184)
    at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:800)
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:576)
    at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:362)
    at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:328)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:307)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3644)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3597)
    at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:680)
    at com.liulishuo.protoss.s3.S3ObjectSummaryReader.getNextObjectSummary(S3ObjectSummaryReader.java:62)

反编译线上对应的jar 包, AWS4SignerRequestParams.java:85行代码如下:

85:    this.formattedSigningDate = AWS4SignerUtils.formatDateStamp(this.signingDateTimeMilli);

那问题就来了: 这一行应该是调用 AWS4SignerUtils.formatDateStamp 方法进入到下一层 stack, 怎么会卡在这里? 而且线程的状态是 java.lang.Thread.State: RUNNABLEObject.wait() [0x00007fee63113000] . 百思不得其解.

交代一下线上环境:

  • OS: Amazon Linux
  • JDK: java-1.7.0-openjdk-1.7.0.99

接着使用 debuginfo-install java-1.7.0-openjdk-devel-1.7.0.99-2.6.5.0.66.amzn1.x86_64 安装 debuginfo 并使用 pstack 查看 native stacktrace:

Thread 21 (Thread 0x7fee63115700 (LWP 11258)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fee8843f59b in os::PlatformEvent::park (this=0x7fee70457b00) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/os/linux/vm/os_linux.cpp:5820
#2  0x00007fee8842cbc5 in ObjectMonitor::wait (this=0x24d5528, millis=millis@entry=0, interruptible=interruptible@entry=false, __the_thread__=__the_thread__@entry=0x7fee81cee000) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/runtime/objectMonitor.cpp:1552
#3  0x00007fee88539102 in ObjectSynchronizer::waitUninterruptibly (obj=..., millis=millis@entry=0, __the_thread__=__the_thread__@entry=0x7fee81cee000) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/runtime/synchronizer.cpp:404
#4  0x00007fee8821a8b2 in waitUninterruptibly (__the_thread__=0x7fee81cee000, this=0x7fee631137e0) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/runtime/synchronizer.hpp:159
#5  instanceKlass::initialize_impl (this_oop=..., __the_thread__=__the_thread__@entry=0x7fee81cee000) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/oops/instanceKlass.cpp:462
#6  0x00007fee8821ae01 in instanceKlass::initialize (this=0x7f26b7170, __the_thread__=0x7fee81cee000) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/oops/instanceKlass.cpp:246
#7  0x00007fee8836485b in LinkResolver::resolve_static_call (result=..., resolved_klass=..., method_name=0x7fee8224b7b0, method_signature=0x7fee6db74880, current_klass=..., check_access=, initialize_class=true, __the_thread__=0x7fee81cee000) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/interpreter/linkResolver.cpp:722
#8  0x00007fee88369fbe in resolve_invokestatic (__the_thread__=0x7fee81cee000, index=6, pool=..., result=...) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/interpreter/linkResolver.cpp:1171
#9  LinkResolver::resolve_invoke (result=..., recv=..., recv@entry=..., pool=..., pool@entry=..., index=6, byte=byte@entry=Bytecodes::_invokestatic, __the_thread__=__the_thread__@entry=0x7fee81cee000) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/interpreter/linkResolver.cpp:1143
#10 0x00007fee882512ba in InterpreterRuntime::resolve_invoke (thread=0x7fee81cee000, bytecode=Bytecodes::_invokestatic) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/interpreter/interpreterRuntime.cpp:712
#11 0x00007fee7d01d91a in ?? ()
#12 0x00007fee63113bb8 in ?? ()
#13 0x00007fee7d01d8e3 in ?? ()
#14 0x00000154ecf04b93 in ?? ()
#15 0x000000070fb877f8 in ?? ()
#16 0x00000007261dba70 in ?? ()
#17 0x00007fee63113b78 in ?? ()
#18 0x00000007f26b57c7 in ?? ()
#19 0x00007fee63113bf0 in ?? ()
#20 0x00000007f26b6420 in ?? ()
#21 0x0000000000000000 in ?? ()

简单翻了一下 JDK 源码(之前没看过), 猜测如下:

  • JVM 在 resolve_static_call, 说明上述的 Java 代码的地方是正确的
  • instanceKlass::initialize 的时候被锁了, 难道是 ClassLoader 的死锁?
  • 可以知道的是, 包含上述 Java Class 的 jar 包在 Java 进程中有三个.
#6  0x00007fee8821ae01 in instanceKlass::initialize (this=0x7f26b7170, __the_thread__=0x7fee81cee000) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/oops/instanceKlass.cpp:246
#7  0x00007fee8836485b in LinkResolver::resolve_static_call (result=..., resolved_klass=..., method_name=0x7fee8224b7b0, method_signature=0x7fee6db74880, current_klass=..., check_access=, initialize_class=true, __the_thread__=0x7fee81cee000) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/interpreter/linkResolver.cpp:722
#8  0x00007fee88369fbe in resolve_invokestatic (__the_thread__=0x7fee81cee000, index=6, pool=..., result=...) at /usr/src/debug/java-1.7.0-openjdk/openjdk/hotspot/src/share/vm/interpreter/linkResolver.cpp:1171

补上所有的stacktrace:

  • pstack
  • jstack

暂且写到这里. 原因未明.
-- EOF --

你可能感兴趣的:(记一次很奇怪的 Java 进程Hang住的故障(未解))