笔者是WEB JAVA后台开发,最近在线上遇到过几次服务不可用问题,基本现象是接口请求无响应或响应非常慢达到分钟级别。一般问题发生时我们都会去查看日志,经常遇到没有日志的情况(此时服务无法响应client请求),甚至要去找几个小时前的日志现象发生时,有些接口甚至没有日志打印,查找起来很困难,利用jvm的线程栈工具jstack对于查找问题有很大帮助。
文章以SpringBoot为框架开发一个web demo应用,以接口代码示例几种可能会导致服务无法响应的案例,并讲述如何以jstack等工具排查问题。
环境:单核CPU虚拟机CentOS6 + JAVA8 + SpringBoot
JAVA jstack日志文件中有以下几种状态需要关注的:
1.死锁,Deadlock,线程死锁;
2.执行中,Runnable,线程执行过程中可能会遇到第三方IO等阻塞或循环,仍需要关注;
3.等待资源, Waiting on condition,线程等待条件,可能是在等待网络资源响应请求,具体需结合栈信息stacktrace
进行分析;
4.等待获取监视器,Waiting on monitor entry,一般是互斥锁实现线程同步;
5.条件等待/定时等待,Object.wait() 或 TIMED_WAITING,Object.wait()是让当前线程阻塞,并出让当前线程的拥有的Object锁,直到被持有Object锁的其它线程调用Object.notify()唤醒才继续执行
6.停止/停止中:Parked/Parking。
死循环或长时间循环计算,占用CPU计算资源,导致CPU占满。本例虚拟机CPU为1核,所以CPU占用用率达到100%
,如果是多核,则占用率为1/n
,如四核则为25%
@RequestMapping("loop")
public void threadLoopDemo() throws Exception{
int num = 0;
long start = System.currentTimeMillis() / 1000;
while (true) {
log.info("====> 测试 Loop");
num++;
if (num == Integer.MAX_VALUE) {
log.info("====> rest num");
num = 0;
}
if (System.currentTimeMillis() / 1000 - start > 1000) {
return;
}
}
}
top -c
CPU占用情况,发现此时CPU占用100%
,说明以上死循环独占CPU资源。
ps -mp 3168 -o THREAD,tid,time
或top -H -p 3168
(3168为进程号
) 可打印出进程对应的线程id及运行时间time,可以看到nid=3187
的线程占用CPU82.8%
,且运行时间为2min
jstack 3168
打印输出找到对应问题的堆栈,从下往上看,发现tomcat NIO Channel被locked,说明该请求的线程未释放,仍在执行;同时找到出问题的代码位置。nid=0xc73
换成十进制为nid=3187
,即上述占用CPU高且时间较少的线程。
"http-nio-10015-exec-1" #19 daemon prio=5 os_prio=0 tid=0x00007f75e4050800 nid=0xc73 runnable [0x00007f75e84e4000]
java.lang.Thread.State: RUNNABLE
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
- locked <0x00000000eb1a0b48> (a java.io.BufferedOutputStream) // log日志,输出到file日志文件
at java.io.PrintStream.write(PrintStream.java:482)
- locked <0x00000000eb18a468> (a java.io.PrintStream)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at ch.qos.logback.core.joran.spi.ConsoleTarget$1.write(ConsoleTarget.java:37)
at ch.qos.logback.core.encoder.LayoutWrappingEncoder.doEncode(LayoutWrappingEncoder.java:131)
.... 问题代码出处
at com.ljyhust.demo.web.ThreadTestDemoController.threadLoopDemo(ThreadTestDemoController.java:20)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
..... 此处省略
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1520)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1476)
- locked <0x00000000ec476d30> (a org.apache.tomcat.util.net.NioChannel)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
大量网络IO阻塞,导致占用服务线程,导致服务无法响应。
这里以大家熟悉tomcat
服务为例,tomcat设有最大线程数maxThreads
和最大排队数acceptCount
,这两个参数可以在server.xml
文件中配置。tomcat处理请求可分为以下3种情况:
1.接收一个请求,当启动的线程数或正在运行的线程数< maxThreads
时,则tomcat
会启动一个线程来处理该请求;
2.接收一个请求,当启动的线程数或正在运行的线程数> maxThreads
时,则tomcat
会把请求放入等待队列,等待空闲线程执行请求;
3.接收一个请求,当启动的线程数或正在运行的线程数> maxThreads
&& 请求队列已满时,则tomcat
会直接拒绝请求,此时客户端现象是connection refused
(连接被拒绝)
如果大量的线程在执行请求的过程中由于IO阻塞,则导致线程池占满,服务则无法响应新的请求。
blockIo
接口是问题代码,请求第三方google应用,如果请求缓慢或阻塞则会导致请求线程阻塞,当大量请求线程阻塞占满tomcat
线程池时,则服务无法响应新进来的请求甚至拒绝请求,导致服务“假死”。通过jmeter
模拟 3000 并发请求,查看其它接口如getTime
是否能正常响应。
@RequestMapping("blockIo")
public Object blockIoDemo() throws Exception {
JSONObject resJson = new JSONObject();
try {
JSONObject resStr = RestClientUtil.getRestTemplate().getForObject("http://10.247.63.25:10015/demo/threadTest/resBlock", JSONObject.class);
log.info("=====> 获取text/html {}", resStr);
} catch (Exception e) {
e.printStackTrace();
}
resJson.put("code", "100");
return resJson;
}
@RequestMapping("getTime")
public Object getServerTime() throws Exception {
log.info("=====> 请求开始");
JSONObject resJson = new JSONObject();
String format = DateFormatUtils.format(new Date(), "yyyy-MM-dd HH:mm:ss");
resJson.put("code", "100");
resJson.put("reqTime", format);
log.info("=====> 请求结束");
return resJson;
}
接口响应
并发前后,请求getTime
接口,发现并发前getTime
接口响应时间为32ms
,而并发后响应时间则变为12.19s
,说明服务响应已受到影响。
jstack线程堆栈
线程堆栈信息如下,其中有大量的TIMED_WAITING
线程,跟踪stacktrace
找到at org.apache.http.pool.PoolEntryFuture.await(PoolEntryFuture.java:137)
位置,这个是apache httpclient
连接池代码段,这段代码说明http连接池不够用,造成大量请求等待。
我们来看看RUNNABLE
状态的线程nid=0xab7
,该线程正在执行请求,跟踪代码出处从下往上找,java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
说明正在读取网络资源。
// 请求进入等待队列中
"http-nio-10015-exec-188" #208 daemon prio=5 os_prio=0 tid=0x0000000001bdb800 nid=0xab9 waiting on condition [0x00007f920f2af000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000e29149c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkUntil(LockSupport.java:256)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUntil(AbstractQueuedSynchronizer.java:2120)
at org.apache.http.pool.PoolEntryFuture.await(PoolEntryFuture.java:137)
at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:307)
at org.apache.http.pool.AbstractConnPool.access$000(AbstractConnPool.java:65)
at org.apache.http.pool.AbstractConnPool$2.getPoolEntry(AbstractConnPool.java:193)
at org.apache.http.pool.AbstractConnPool$2.getPoolEntry(AbstractConnPool.java:186)
at org.apache.http.pool.PoolEntryFuture.get(PoolEntryFuture.java:108)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:282)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:269)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:191)
....
at com.ljyhust.demo.web.ThreadTestDemoController.blockIoDemo(ThreadTestDemoController.java:46)
at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:221)
at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:136)
at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:110)
at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:832)
at ....
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1520)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1476)
- locked <0x00000000ee823bb0> (a org.apache.tomcat.util.net.NioChannel)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
"http-nio-10015-exec-187" #207 daemon prio=5 os_prio=0 tid=0x0000000001bd9800 nid=0xab8 waiting on condition [0x00007f920f3b0000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000e142d7b8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkUntil(LockSupport.java:256)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUntil(AbstractQueuedSynchronizer.java:2120)
at org.apache.http.pool.PoolEntryFuture.await(PoolEntryFuture.java:137)
at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:307)
at org.apache.http.pool.AbstractConnPool.access$000(AbstractConnPool.java:65)
at org.apache.http.pool.AbstractConnPool$2.getPoolEntry(AbstractConnPool.java:193)
at org.apache.http.pool.AbstractConnPool$2.getPoolEntry(AbstractConnPool.java:186)
at org.apache.http.pool.PoolEntryFuture.get(PoolEntryFuture.java:108)
...
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:88)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:528)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1099)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:670)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1520)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1476)
- locked <0x00000000ee821b30> (a org.apache.tomcat.util.net.NioChannel)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
// 请求第三方资源read
"http-nio-10015-exec-186" #206 daemon prio=5 os_prio=0 tid=0x0000000001bd7800 nid=0xab7 runnable [0x00007f920f4b1000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
...
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at org.springframework.http.client.HttpComponentsClientHttpRequest.executeInternal(HttpComponentsClientHttpRequest.java:91)
at org.springframework.http.client.AbstractBufferingClientHttpRequest.executeInternal(AbstractBufferingClientHttpRequest.java:48)
at org.springframework.http.client.AbstractClientHttpRequest.execute(AbstractClientHttpRequest.java:53)
at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:596)
at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:557)
at org.springframework.web.client.RestTemplate.getForObject(RestTemplate.java:264)
at com.ljyhust.demo.web.ThreadTestDemoController.blockIoDemo(ThreadTestDemoController.java:46)
at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
...
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1520)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1476)
- locked <0x00000000ec853808> (a org.apache.tomcat.util.net.NioChannel)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
死锁是由于多线程争夺互斥资源导致的。例如 1、2线程分别占用A、B锁,但同时在临界区代码中又需要B、A锁,由于各自获取了对方所需要的锁,最终导致死锁。
满足死锁的条件有以下四个,缺一不可:
1.互斥条件,即不能同时被两个或两个以上的线程占有;
2.不可抢占条件,即已占用的锁不能被其它线程抢夺;
3.占有且申请条件,即进程已经占有了一个锁,但又需要申请/等待另外一个锁;
4.循环等待条件,即等待其它线程的锁,而其它线程又等待更多线程的锁,且形成一个等待循环。
@RequestMapping("deadLock")
public Object deadLockDemo() throws Exception {
log.info("=====> 请求开始");
JSONObject resJson = new JSONObject();
Thread t1 = new Thread(new Runnable() {
@Override
public void run() {
try {
deadLockThreadDemo.getLockAB();
} catch (Exception e) {
e.printStackTrace();
}
}
});
Thread t2 = new Thread(new Runnable() {
@Override
public void run() {
try {
deadLockThreadDemo.getLockBA();
} catch (Exception e) {
e.printStackTrace();
}
}
});
t1.start();
t2.start();
log.info("=====> 请求结束");
resJson.put("code", "100");
return resJson;
}
public void getLockAB() throws Exception {
// 锁A
synchronized (objectA) {
try {
Thread.sleep(2000);
} catch (InterruptedException e) {
e.printStackTrace();
}
// 锁B
log.info("线程1尝试获取B锁");
synchronized (objectB) {
log.info("线程1获取到B锁");
}
}
}
public void getLockBA() throws Exception {
// 锁B
synchronized (objectB) {
try {
Thread.sleep(2000);
} catch (InterruptedException e) {
e.printStackTrace();
}
// 锁A
log.info("线程2尝试获取A锁");
synchronized (objectA) {
log.info("线程2获取到A锁");
}
}
}
Thread-6
先锁0x00000000ebe97e18
对象然后等待0x00000000ebe97e08
;线程Thread-5
先锁0x00000000ebe97e08
对象然后待0x00000000ebe97e18
,两个线程彼此等待,导致死锁BLOCKED (on object monitor)
。"Thread-6" #24 daemon prio=5 os_prio=0 tid=0x00007f6ce8ca6800 nid=0xbd7 waiting for monitor entry [0x00007f6d081f8000]
java.lang.Thread.State: BLOCKED (on object monitor)
at com.ljyhust.demo.service.DeadLockThreadDemo.getLockBA(DeadLockThreadDemo.java:42)
- waiting to lock <0x00000000ebe97e08> (a java.lang.Object)
- locked <0x00000000ebe97e18> (a java.lang.Object)
at com.ljyhust.demo.web.ThreadTestDemoController$2.run(ThreadTestDemoController.java:89)
at java.lang.Thread.run(Thread.java:748)
"Thread-5" #23 daemon prio=5 os_prio=0 tid=0x00007f6ce8576800 nid=0xbd6 waiting for monitor entry [0x00007f6d1cee3000]
java.lang.Thread.State: BLOCKED (on object monitor)
at com.ljyhust.demo.service.DeadLockThreadDemo.getLockAB(DeadLockThreadDemo.java:26)
- waiting to lock <0x00000000ebe97e18> (a java.lang.Object)
- locked <0x00000000ebe97e08> (a java.lang.Object)
at com.ljyhust.demo.web.ThreadTestDemoController$1.run(ThreadTestDemoController.java:78)
at java.lang.Thread.run(Thread.java:748)
无论是CPU飙高还是服务响应缓慢,当从日志中找不出问题甚至没有日志打印的时候,可以利用jstack
命令打印线程堆栈信息、结合top -H -p
可能会找到问题原因。尤其是当服务调用其它资源较多时,而又找不到具体哪个服务问题时,不妨试下这个命令找找看。