最近我们的应用搞了一次线上大促,导致流量暴增,应用瞬时并发达到了平时几十上百倍,结果出现了一个热点接口大量超时的情况。这里记录下排查过程,以备后查,也希望能给大家一些启发。
在排查超时接口的时候,发现了一个很经典的现象,就是超时接口本身只做了转发逻辑,其调用的远程接口,几乎都是瞬间返回的:通过nginx日志和应用日志可以看到,在高并发的时刻,超时接口从接收到请求到返回响应花了7、8秒甚至更长(下图请求1),而其内部依赖的远程接口的响应则是在毫秒级的(下图请求2),并且响应5和6几乎是同时发生的。
那么要解决这个问题其实就变成了去找到gateway接受到请求1后为什么没有及时发出http请求到serviceA,这里我首先就把restTemplate定为了主要排查目标对象,是不是RestTemplate
的问题导致了请求积压。为了证明猜想,我们需要收集更多的信息,通过在测试环境压测复现接口超时的场景,并且通过jstack命令获取到高并发下的gateway应用的堆栈信息,然后将thread stack dump文件丢到gceasy(fastThread)这个网站分析一波,看看当时的活跃线程情况:
好家伙,人家的ML直接分析出来了应用存在无法响应的情况,还真是给他说对了。
接着看几个关键的指标,首先是线程状态分布,可以看到大量的线程都处于WAITING
状态:
而根据线程组聚合统计,其中最多的线程都来自http请求,然而575个http线程中,真正RUNNABLE
的数量只有少得可怜的几个,大量的都是WAITING
和TIMED_WAITING
的状态,这也和我们应用的表象一致:
其中的堆栈完全一致的WAITING
线程达到了497个:
那么思路很明确了,就是看下这些WAITING
的http线程都在做什么。是什么导致了 RestTemplate
不能及时创建http请求。
通过查看这些线程的具体stacktrace,发现这些WAITING
的线程正是执行了RestTemplate
的线程,它们内部都是使用了apache httpClient
创建连接,通过PoolingHttpClientConnectionManager.leaseConnection这条信息可以看出,大量的线程正尝试获取连接池中的连接,而连接资源是通过AQS获取,目前处于阻塞等待的状态:
http-nio-8080-exec-571
PRIORITY : 5
THREAD ID : 0X00007F2A20263000
NATIVE ID : 0X52AA
NATIVE ID (DECIMAL) : 21162
STATE : WAITING
stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000662488168> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:380)
at org.apache.http.pool.AbstractConnPool.access$200(AbstractConnPool.java:69)
at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:246)
- locked <0x00000007595108c0> (a org.apache.http.pool.AbstractConnPool$2)
at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:193)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:240)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:227)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:173)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at org.springframework.http.client.HttpComponentsClientHttpRequest.executeInternal(HttpComponentsClientHttpRequest.java:89)
at org.springframework.http.client.AbstractBufferingClientHttpRequest.executeInternal(AbstractBufferingClientHttpRequest.java:48)
at org.springframework.http.client.AbstractClientHttpRequest.execute(AbstractClientHttpRequest.java:53)
at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:659)
at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:620)
at org.springframework.web.client.RestTemplate.postForEntity(RestTemplate.java:414)
下略...
那么问题的原因基本就已定位到了,一定是跟httpClient
的连接池有关,我们看到代码里是这么定义restTemplate
的:
@Bean
public RestTemplate restTemplate() {
HttpComponentsClientHttpRequestFactory httpFactory = new
HttpComponentsClientHttpRequestFactory();
requestFactory.setConnectTimeout(3000);
requestFactory.setReadTimeout(5000);
RestTemplate restTemplate = new RestTemplate(httpFactory);
restTemplate.getMessageConverters().add(0, new StringHttpMessageConverter(StandardCharsets.UTF_8));
return restTemplate;
}
其中HttpComponentsClientHttpRequestFactory
类的作用是生成httpClient
,通过查看源码发现,如果没有显示的指定,则会使用系统配置
if (systemProperties) {
String s = System.getProperty("http.keepAlive", "true");
if ("true".equalsIgnoreCase(s)) {
s = System.getProperty("http.maxConnections", "5");
final int max = Integer.parseInt(s);
poolingmgr.setDefaultMaxPerRoute(max);
poolingmgr.setMaxTotal(2 * max);
}
}
其中maxTotal
表示连接池的总大小,而maxPerRoute
表示每个路由的最大连接大小,也就是说对于同一个域名的http请求,httpClient最多同时只会建立5个连接,怪不得有那么多线程处于等待状态了。我们这里可以回过头再次查看堆栈信息,确认是否只有5个连接被建立了。
通过在所有的575个http线程中搜索关键堆栈信息org.apache.http.impl.io.SessionInputBufferImpl.streamRead,发现的确刚好只有5个RUNNABLE
的线程正在建立scoket连接。
那么解决方案也就很清晰了,只需要将RestTemplate
所使用的httpClient
进行合理的连接池配置,即可完成优化:
@Bean
public RestTemplate restTemplate(HttpClient httpClient) {
HttpComponentsClientHttpRequestFactory requestFactory = new
HttpComponentsClientHttpRequestFactory();
requestFactory.setHttpClient(httpClient);
RestTemplate restTemplate = new RestTemplate(requestFactory);
restTemplate.getMessageConverters().add(0, new StringHttpMessageConverter(StandardCharsets.UTF_8));
return restTemplate;
}
@Bean
public HttpClient httpClient(PoolingHttpClientConnectionManager connectionManager){
RequestConfig requestConfig = RequestConfig.custom().setConnectTimeout(3000)
.setSocketTimeout(5000).setConnectionRequestTimeout(1000).build();
return HttpClientBuilder.create()
.setConnectionManager(connectionManager)
.setDefaultRequestConfig(requestConfig)
.build();
}
@Bean(destroyMethod = "close")
public PoolingHttpClientConnectionManager poolingHttpClientConnectionManager() {
PoolingHttpClientConnectionManager connectionManager = new PoolingHttpClientConnectionManager();
//连接池大小
connectionManager.setMaxTotal(1000);
//每个路由连接大小
connectionManager.setDefaultMaxPerRoute(200);
return connectionManager;
}
改完之后再压测下看下效果,发现现在gateway几乎所有的http线程都是RUNNABLE
的状态了:
如果点开查看具体的堆栈信息,就会看到这些线程都是处于执行socketRead的操作,现在压力已经到了下游的seviceA中:
http-nio-8080-exec-102
PRIORITY : 5
THREAD ID : 0X00007F7EE40E1000
NATIVE ID : 0X4B17
NATIVE ID (DECIMAL) : 19223
STATE : RUNNABLE
stackTrace:
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
下略...
但从压测结果来看,该接口还是存在大量超时,同时从日志来看,这个时候超时的已经是ServiceA提供的接口了:
ServiceA->ServiceB和gateway->ServiceA唯一的区别只是使用的http客户端不同,由于是Spring Cloud框架,所以ServiceA使用的是feign
,那么我们也有理由怀疑是feign导致了请求的积压。使用排查gateway一样的方法,我们可以看到,ServiceA在高并发场景下大量的http线程处于WAITING
状态,只有5个线程处于RUNNABLE
,到这里答案以及呼之欲出了。
我们确认下这些WAITING
状态的线程堆栈信息,果然也是正在等待从连接池中的获取连接:
http-nio-8080-exec-294
PRIORITY : 5
THREAD ID : 0X00007F5F3B119800
NATIVE ID : 0X1C9B
NATIVE ID (DECIMAL) : 7323
STATE : WAITING
stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007a8617d80> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at org.apache.http.pool.PoolEntryFuture.await(PoolEntryFuture.java:138)
at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:306)
at org.apache.http.pool.AbstractConnPool.access$000(AbstractConnPool.java:64)
at org.apache.http.pool.AbstractConnPool$2.getPoolEntry(AbstractConnPool.java:192)
at org.apache.http.pool.AbstractConnPool$2.getPoolEntry(AbstractConnPool.java:185)
at org.apache.http.pool.PoolEntryFuture.get(PoolEntryFuture.java:107)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:276)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at feign.httpclient.ApacheHttpClient.execute(ApacheHttpClient.java:87)
at org.springframework.cloud.netflix.feign.ribbon.RetryableFeignLoadBalancer$1.doWithRetry(RetryableFeignLoadBalancer.java:92)
at org.springframework.cloud.netflix.feign.ribbon.RetryableFeignLoadBalancer$1.doWithRetry(RetryableFeignLoadBalancer.java:77)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:287)
at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:164)
at org.springframework.cloud.netflix.feign.ribbon.RetryableFeignLoadBalancer.execute(RetryableFeignLoadBalancer.java:77)
at org.springframework.cloud.netflix.feign.ribbon.RetryableFeignLoadBalancer.execute(RetryableFeignLoadBalancer.java:48)
at com.netflix.client.AbstractLoadBalancerAwareClient$1.call(AbstractLoadBalancerAwareClient.java:109)
at com.netflix.loadbalancer.reactive.LoadBalancerCommand$3$1.call(LoadBalancerCommand.java:303)
at com.netflix.loadbalancer.reactive.LoadBalancerCommand$3$1.call(LoadBalancerCommand.java:287)
...
中略
...
com.netflix.client.AbstractLoadBalancerAwareClient.executeWithLoadBalancer(AbstractLoadBalancerAwareClient.java:117)
at org.springframework.cloud.netflix.feign.ribbon.LoadBalancerFeignClient.execute(LoadBalancerFeignClient.java:63)
at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:97)
at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:76)
at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:103)
at com.sun.proxy.$Proxy199.xxx微服务接口名(Unknown Source)
下略...
那么解决方案也是和RestTemplate
一致,将Feign
所使用的httpClient的连接池进行显示的配置:
@Bean
public HttpClient httpClient(PoolingHttpClientConnectionManager connectionManager){
RequestConfig requestConfig = RequestConfig.custom().setConnectTimeout(3000)
.setSocketTimeout(5000).setConnectionRequestTimeout(1000).build();
return HttpClientBuilder.create()
.setConnectionManager(connectionManager)
.setDefaultRequestConfig(requestConfig)
.build();
}
@Bean(destroyMethod = "close")
public PoolingHttpClientConnectionManager poolingHttpClientConnectionManager() {
PoolingHttpClientConnectionManager connectionManager = new PoolingHttpClientConnectionManager();
//连接池大小
connectionManager.setMaxTotal(1000);
//每个路由连接大小
connectionManager.setDefaultMaxPerRoute(200);
return connectionManager;
}
至此为止,问题全部解决,该接口在100线程的压测下,从5~6QPS,增长到了100QPS。
最后补充说明下,RestTemplate
和Feign
作为http调用工具,如果什么都不设置实际上默认使用的jdk自带的UrlConnection
作为http客户端的,而这个UrlConnection
是压根没有连接池的,不能支持高并发的场景。一般我们会选择apache httpClient
或者是okHttp
来替换。本文中的应用都是使用了apache httpClient
的,对于使用了okHttp
的应用,在高并发下连接池的配置也是必须的。
另外,对于Feign
来说,如果要使用apache httpClient
,只需要引入相应依赖就会自动替换UrlConnection
:
io.github.openfeign
feign-httpclient
9.5.0
如果要使用okHttp
,除了引入相应依赖外,还需将相关的配置项开启。
io.github.openfeign
feign-okhttp
10.1.0
feign:
httpclient:
enabled:false
okhttp:
enabled:true