前言
很久没有写过与JVM相关的文章了。今天搬砖有点累,不太想啃源码,写一篇实用性比较强的吧。
在日常工作中,我们如果遇到JVM方面的问题,一般是采用各种现成的工具辅助定位解决,如VisualVM、JProfiler、Eclipse MAT、Arthas等。但是,我们也有必要了解JVM原生提供的那些命令行工具,本文就针对常用的jstack命令做个简单的总结,顺便聊一些与Java线程与并发相关的周边知识。今后有时间的话,也会将jmap、jhat、jstat等命令的用法总结出来。
jstack简介
jstack命令用来生成JVM中的线程快照(thread dump),其中包含有每个线程的方法调用栈以及其状态、锁信息等。其用法说明如下所示。
~ jstack -h
Usage:
jstack [-l]
(to connect to running process)
jstack -F [-m] [-l]
(to connect to a hung process)
jstack [-m] [-l]
(to connect to a core file)
jstack [-m] [-l] [server_id@]
(to connect to a remote debug server)
Options:
-F to force a thread dump. Use when jstack does not respond (process is hung)
-m to print both java and native frames (mixed mode)
-l long listing. Prints additional information about locks
-h or -help to print this help message
说明一下三个参数的含义:
- -F:如果正常执行jstack命令没有响应(比如进程hung住了),可以加上此参数强制执行thread dump。
- -m:除了打印Java的方法调用栈之外,还会输出native方法的栈帧。
- -l:打印与锁有关的附加信息。使用此参数会导致JVM停止时间变长,在生产环境需慎用。
jstack是在线程级别定位JVM问题的利器,但前提是得读懂thread dump,我们举例说明。
线程快照
下面是对一个正常运行的Spark Streaming作业执行jstack命令产生的线程快照的一部分。
~ jstack 18747
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.172-b11 mixed mode):
"SparkUI-1510250" #1510250 daemon prio=5 os_prio=0 tid=0x00007f9a6c00c800 nid=0x45cc waiting on condition [0x00007f9ce86e7000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000c0420db8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at org.spark_project.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392)
at org.spark_project.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:563)
at org.spark_project.jetty.util.thread.QueuedThreadPool.access$800(QueuedThreadPool.java:48)
at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)
at java.lang.Thread.run(Thread.java:748)
"shuffle-server-6-7" #190 daemon prio=5 os_prio=0 tid=0x00007f9b44009000 nid=0x4d80 runnable [0x00007f9ce99f8000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00000000c58e6498> (a io.netty.channel.nio.SelectedSelectionKeySet)
- locked <0x00000000c59c1528> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000c59c1450> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:753)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:409)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
"SparkUI-84-acceptor-3@3b331d23-ServerConnector@9826a7d{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}" #84 daemon prio=3 os_prio=0 tid=0x00007f9d7decc800 nid=0x4500 waiting for monitor entry [0x00007f9d112c8000]
java.lang.Thread.State: BLOCKED (on object monitor)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:234)
- waiting to lock <0x00000000c045e868> (a java.lang.Object)
at org.spark_project.jetty.server.ServerConnector.accept(ServerConnector.java:373)
at org.spark_project.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:593)
at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
"org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner" #25 daemon prio=5 os_prio=0 tid=0x00007f9d7d8da000 nid=0x44c2 in Object.wait() [0x00007f9d19dc0000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
- locked <0x00000000c0031c98> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:3213)
at java.lang.Thread.run(Thread.java:748)
"RecurringTimer - JobGenerator" #120 daemon prio=5 os_prio=0 tid=0x00007f9b04045000 nid=0x4cc3 waiting on condition [0x00007f9d10cd4000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.spark.util.SystemClock.waitTillTime(Clock.scala:63)
at org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:93)
at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
在每个线程的快照的第一行,包含有线程名、是否为守护线程、优先级、线程ID等信息,第二行则是线程状态,下面就是方法调用栈了。下图是Java线程状态转换的示意,老生常谈。
jstack线程快照中的状态与图示相同,只是没有NEW状态而已。我们逐一进行分析,在分析之前,先放出Java管程对象ObjectMonitor的简图。看官也可以通过我之前写的这篇文章来了解管程。
RUNNABLE
线程正在运行。如果在其调用栈中看到locked <地址>
的提示,表示该方法正持有锁,即该线程位于Owner区内。
BLOCKED
线程处于阻塞状态,即正在等待锁被其他线程释放。在其调用栈的栈顶方法会看到waiting to lock <地址>
的提示,表示该方法试图持有锁,线程正在Entry Set区等待。
WAITING
线程处于无限等待的状态。又分为两种情况:
- on object monitor:线程已经获得锁,调用了不带超时参数的Object.wait()/Thread.join()方法,线程进入管程的Wait Set区。在其调用栈中会看到
locked <地址>
的提示。 - parking:调用了LockSupport.park()方法,线程直接进入挂起状态(park是Unsafe提供的低级原语)。在其调用栈的栈顶方法会看到
parking to wait for <地址>
的提示。
TIMED_WAITING
线程处于有限等待的状态。它分为三种情况,除了与WAITING相同的on object monitor(获得锁并调用带超时的Object.wait()/Thread.join()方法)和parking(调用带超时的LockSupport.parkNanos()/parkUntil()方法)之外,还有一种sleep,即通过Thread.sleep()使线程睡眠。
通过分析线程快照的状态和调用栈,可以让我们快速地定位造成Java程序表现异常的症结,如死锁、热锁(很多线程竞争同一块临界区造成大量BLOCKED)、高CPU占用、I/O长时间阻塞(注意此时线程状态可能是RUNNABLE)等。下面举两个具体的例子。
用jstack诊断死锁
死锁(deadlock)是操作系统理论中的基础概念,即在并发环境下,一个或多个线程在等待资源,但该资源又被其他进程所占用的困局。死锁的四个必要条件是:
- 互斥(mutual exclusion)
- 不可抢占(no preemption)
- 持有并等待(hold and wait)
- 循环等待(circular wait)
下面我们用Java造一个死锁,并用jstack来诊断它。
public class DeadlockExample {
private static final Object lock1 = new Object();
private static final Object lock2 = new Object();
public static void main(String[] args) throws Exception {
new Thread(() -> {
for (int i = 0; i < 100; i++) {
synchronized (lock1) {
System.out.println("thread1 synchronized lock1");
synchronized (lock2) {
System.out.println("thread1 synchronized lock2");
}
}
}
}, "thread1").start();
new Thread(() -> {
for (int i = 0; i < 100; i++) {
synchronized (lock2) {
System.out.println("thread2 synchronized lock2");
synchronized (lock1) {
System.out.println("thread2 synchronized lock1");
}
}
}
}, "thread2").start();
}
}
运行之,发现只输出了几句就停止了。
thread1 synchronized lock1
thread1 synchronized lock2
thread1 synchronized lock1
thread1 synchronized lock2
thread1 synchronized lock1
thread2 synchronized lock2
用jstack打印线程快照,节选部分如下。
"thread2" #20 prio=5 os_prio=31 tid=0x00007fad74020800 nid=0x6203 waiting for monitor entry [0x0000700006364000]
java.lang.Thread.State: BLOCKED (on object monitor)
at me.lmagics.DeadlockExample.lambda$main$1(DeadlockExample.java:28)
- waiting to lock <0x00000007157d2a58> (a java.lang.Object)
- locked <0x00000007157d2a68> (a java.lang.Object)
at me.lmagics.DeadlockExample$$Lambda$2/501263526.run(Unknown Source)
at java.lang.Thread.run(Thread.java:748)
"thread1" #19 prio=5 os_prio=31 tid=0x00007fad7401c000 nid=0x9903 waiting for monitor entry [0x0000700006261000]
java.lang.Thread.State: BLOCKED (on object monitor)
at me.lmagics.DeadlockExample.lambda$main$0(DeadlockExample.java:17)
- waiting to lock <0x00000007157d2a68> (a java.lang.Object)
- locked <0x00000007157d2a58> (a java.lang.Object)
at me.lmagics.DeadlockExample$$Lambda$1/1394438858.run(Unknown Source)
at java.lang.Thread.run(Thread.java:748)
Found one Java-level deadlock:
=============================
"thread2":
waiting to lock monitor 0x00007fad65004168 (object 0x00000007157d2a58, a java.lang.Object),
which is held by "thread1"
"thread1":
waiting to lock monitor 0x00007fad650056b8 (object 0x00000007157d2a68, a java.lang.Object),
which is held by "thread2"
Java stack information for the threads listed above:
===================================================
"thread2":
at me.lmagics.DeadlockExample.lambda$main$1(DeadlockExample.java:28)
- waiting to lock <0x00000007157d2a58> (a java.lang.Object)
- locked <0x00000007157d2a68> (a java.lang.Object)
at me.lmagics.DeadlockExample$$Lambda$2/501263526.run(Unknown Source)
at java.lang.Thread.run(Thread.java:748)
"thread1":
at me.lmagics.DeadlockExample.lambda$main$0(DeadlockExample.java:17)
- waiting to lock <0x00000007157d2a68> (a java.lang.Object)
- locked <0x00000007157d2a58> (a java.lang.Object)
at me.lmagics.DeadlockExample$$Lambda$1/1394438858.run(Unknown Source)
at java.lang.Thread.run(Thread.java:748)
Found 1 deadlock.
可见,我们不仅能够发现两个线程都处于BLOCKED状态,并且jstack还直接给出了死锁的详细信息,方便我们修改代码消除死锁。
用jstack诊断高CPU占用
接下来造一个有死循环的程序,模拟异常的CPU占用。
public class InfiniteLoopExample {
private static final Object lock = new Object();
static class InfiniteLoopRunnable implements Runnable {
@Override
public void run() {
synchronized (lock) {
long l = 0;
while (true) {
l++;
}
}
}
}
public static void main(String[] args) throws Exception {
new Thread(new InfiniteLoopRunnable(), "thread1").start();
new Thread(new InfiniteLoopRunnable(), "thread2").start();
}
}
运行该程序,用jps命令找出该Java进程的PID,然后利用top -Hp
命令找出吃CPU最多的那个线程。
使用jstack导出线程快照到文件中。由于线程ID是十六进制表示的,所以我们要将线程ID转换成十六进制再grep。
这下我们就可以定位到异常代码的位置,并进行修改了。
The End
晚安晚安。