高并发调优案例之httpClient

最近我们的应用搞了一次线上大促,导致流量暴增,应用瞬时并发达到了平时几十上百倍,结果出现了一个热点接口大量超时的情况。这里记录下排查过程,以备后查,也希望能给大家一些启发。

在排查超时接口的时候,发现了一个很经典的现象,就是超时接口本身只做了转发逻辑,其调用的远程接口,几乎都是瞬间返回的:通过nginx日志和应用日志可以看到,在高并发的时刻,超时接口从接收到请求到返回响应花了7、8秒甚至更长(下图请求1),而其内部依赖的远程接口的响应则是在毫秒级的(下图请求2),并且响应5和6几乎是同时发生的。

调用时序图

那么要解决这个问题其实就变成了去找到gateway接受到请求1后为什么没有及时发出http请求到serviceA,这里我首先就把restTemplate定为了主要排查目标对象,是不是RestTemplate的问题导致了请求积压。为了证明猜想,我们需要收集更多的信息,通过在测试环境压测复现接口超时的场景,并且通过jstack命令获取到高并发下的gateway应用的堆栈信息,然后将thread stack dump文件丢到gceasy(fastThread)这个网站分析一波,看看当时的活跃线程情况:

好家伙,人家的ML直接分析出来了应用存在无法响应的情况,还真是给他说对了。

接着看几个关键的指标,首先是线程状态分布,可以看到大量的线程都处于WAITING状态:

线程状态

而根据线程组聚合统计,其中最多的线程都来自http请求,然而575个http线程中,真正RUNNABLE的数量只有少得可怜的几个,大量的都是WAITINGTIMED_WAITING的状态,这也和我们应用的表象一致:

线程分组

其中的堆栈完全一致的WAITING线程达到了497个:

堆栈一致的线程

那么思路很明确了,就是看下这些WAITING的http线程都在做什么。是什么导致了 RestTemplate不能及时创建http请求。

通过查看这些线程的具体stacktrace,发现这些WAITING的线程正是执行了RestTemplate的线程,它们内部都是使用了apache httpClient创建连接,通过PoolingHttpClientConnectionManager.leaseConnection这条信息可以看出,大量的线程正尝试获取连接池中的连接,而连接资源是通过AQS获取,目前处于阻塞等待的状态:

http-nio-8080-exec-571
PRIORITY : 5

THREAD ID : 0X00007F2A20263000

NATIVE ID : 0X52AA

NATIVE ID (DECIMAL) : 21162

STATE : WAITING


stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000662488168> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:380)
at org.apache.http.pool.AbstractConnPool.access$200(AbstractConnPool.java:69)
at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:246)
- locked <0x00000007595108c0> (a org.apache.http.pool.AbstractConnPool$2)
at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:193)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:240)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:227)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:173)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at org.springframework.http.client.HttpComponentsClientHttpRequest.executeInternal(HttpComponentsClientHttpRequest.java:89)
at org.springframework.http.client.AbstractBufferingClientHttpRequest.executeInternal(AbstractBufferingClientHttpRequest.java:48)
at org.springframework.http.client.AbstractClientHttpRequest.execute(AbstractClientHttpRequest.java:53)
at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:659)
at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:620)
at org.springframework.web.client.RestTemplate.postForEntity(RestTemplate.java:414)
下略...

那么问题的原因基本就已定位到了,一定是跟httpClient的连接池有关,我们看到代码里是这么定义restTemplate的:

    @Bean
    public RestTemplate restTemplate() {
        HttpComponentsClientHttpRequestFactory httpFactory = new
                HttpComponentsClientHttpRequestFactory();
        requestFactory.setConnectTimeout(3000);
        requestFactory.setReadTimeout(5000);
        RestTemplate restTemplate = new RestTemplate(httpFactory);
        restTemplate.getMessageConverters().add(0, new StringHttpMessageConverter(StandardCharsets.UTF_8));
        return restTemplate;
    }

其中HttpComponentsClientHttpRequestFactory类的作用是生成httpClient,通过查看源码发现,如果没有显示的指定,则会使用系统配置

  if (systemProperties) {
    String s = System.getProperty("http.keepAlive", "true");
    if ("true".equalsIgnoreCase(s)) {
    s = System.getProperty("http.maxConnections", "5");
    final int max = Integer.parseInt(s);
    poolingmgr.setDefaultMaxPerRoute(max);
    poolingmgr.setMaxTotal(2 * max);
    }
  }

其中maxTotal表示连接池的总大小,而maxPerRoute表示每个路由的最大连接大小,也就是说对于同一个域名的http请求,httpClient最多同时只会建立5个连接,怪不得有那么多线程处于等待状态了。我们这里可以回过头再次查看堆栈信息,确认是否只有5个连接被建立了。

通过在所有的575个http线程中搜索关键堆栈信息org.apache.http.impl.io.SessionInputBufferImpl.streamRead,发现的确刚好只有5个RUNNABLE的线程正在建立scoket连接。

5个获取到连接资源的线程

那么解决方案也就很清晰了,只需要将RestTemplate所使用的httpClient进行合理的连接池配置,即可完成优化:

    @Bean
    public RestTemplate restTemplate(HttpClient httpClient) {
        HttpComponentsClientHttpRequestFactory requestFactory = new
                HttpComponentsClientHttpRequestFactory();
        requestFactory.setHttpClient(httpClient);
        RestTemplate restTemplate = new RestTemplate(requestFactory);
        restTemplate.getMessageConverters().add(0, new StringHttpMessageConverter(StandardCharsets.UTF_8));
        return restTemplate;
    }

    @Bean
    public HttpClient httpClient(PoolingHttpClientConnectionManager connectionManager){
        RequestConfig requestConfig = RequestConfig.custom().setConnectTimeout(3000)
                .setSocketTimeout(5000).setConnectionRequestTimeout(1000).build();
        return HttpClientBuilder.create()
                .setConnectionManager(connectionManager)
                .setDefaultRequestConfig(requestConfig)
                .build();
    }

    @Bean(destroyMethod = "close")
    public PoolingHttpClientConnectionManager poolingHttpClientConnectionManager() {
        PoolingHttpClientConnectionManager connectionManager = new PoolingHttpClientConnectionManager();
        //连接池大小
        connectionManager.setMaxTotal(1000);
        //每个路由连接大小
        connectionManager.setDefaultMaxPerRoute(200);
        return connectionManager;
    }

改完之后再压测下看下效果,发现现在gateway几乎所有的http线程都是RUNNABLE的状态了:

http线程都是runnable

如果点开查看具体的堆栈信息,就会看到这些线程都是处于执行socketRead的操作,现在压力已经到了下游的seviceA中:

http-nio-8080-exec-102
PRIORITY : 5

THREAD ID : 0X00007F7EE40E1000

NATIVE ID : 0X4B17

NATIVE ID (DECIMAL) : 19223

STATE : RUNNABLE


stackTrace:
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
下略...

但从压测结果来看,该接口还是存在大量超时,同时从日志来看,这个时候超时的已经是ServiceA提供的接口了:

serviceA接口超时

ServiceA->ServiceB和gateway->ServiceA唯一的区别只是使用的http客户端不同,由于是Spring Cloud框架,所以ServiceA使用的是feign,那么我们也有理由怀疑是feign导致了请求的积压。使用排查gateway一样的方法,我们可以看到,ServiceA在高并发场景下大量的http线程处于WAITING状态,只有5个线程处于RUNNABLE,到这里答案以及呼之欲出了。

feign-httpclient-before-thread-group.png

我们确认下这些WAITING状态的线程堆栈信息,果然也是正在等待从连接池中的获取连接:

http-nio-8080-exec-294
PRIORITY : 5

THREAD ID : 0X00007F5F3B119800

NATIVE ID : 0X1C9B

NATIVE ID (DECIMAL) : 7323

STATE : WAITING


stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007a8617d80> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at org.apache.http.pool.PoolEntryFuture.await(PoolEntryFuture.java:138)
at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:306)
at org.apache.http.pool.AbstractConnPool.access$000(AbstractConnPool.java:64)
at org.apache.http.pool.AbstractConnPool$2.getPoolEntry(AbstractConnPool.java:192)
at org.apache.http.pool.AbstractConnPool$2.getPoolEntry(AbstractConnPool.java:185)
at org.apache.http.pool.PoolEntryFuture.get(PoolEntryFuture.java:107)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:276)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at feign.httpclient.ApacheHttpClient.execute(ApacheHttpClient.java:87)
at org.springframework.cloud.netflix.feign.ribbon.RetryableFeignLoadBalancer$1.doWithRetry(RetryableFeignLoadBalancer.java:92)
at org.springframework.cloud.netflix.feign.ribbon.RetryableFeignLoadBalancer$1.doWithRetry(RetryableFeignLoadBalancer.java:77)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:287)
at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:164)
at org.springframework.cloud.netflix.feign.ribbon.RetryableFeignLoadBalancer.execute(RetryableFeignLoadBalancer.java:77)
at org.springframework.cloud.netflix.feign.ribbon.RetryableFeignLoadBalancer.execute(RetryableFeignLoadBalancer.java:48)
at com.netflix.client.AbstractLoadBalancerAwareClient$1.call(AbstractLoadBalancerAwareClient.java:109)
at com.netflix.loadbalancer.reactive.LoadBalancerCommand$3$1.call(LoadBalancerCommand.java:303)
at com.netflix.loadbalancer.reactive.LoadBalancerCommand$3$1.call(LoadBalancerCommand.java:287)
...
中略
...
com.netflix.client.AbstractLoadBalancerAwareClient.executeWithLoadBalancer(AbstractLoadBalancerAwareClient.java:117)
at org.springframework.cloud.netflix.feign.ribbon.LoadBalancerFeignClient.execute(LoadBalancerFeignClient.java:63)
at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:97)
at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:76)
at feign.ReflectiveFeign$FeignInvocationHandler.invoke(ReflectiveFeign.java:103)
at com.sun.proxy.$Proxy199.xxx微服务接口名(Unknown Source)
下略...

那么解决方案也是和RestTemplate一致,将Feign所使用的httpClient的连接池进行显示的配置:

    @Bean
    public HttpClient httpClient(PoolingHttpClientConnectionManager connectionManager){
        RequestConfig requestConfig = RequestConfig.custom().setConnectTimeout(3000)
                .setSocketTimeout(5000).setConnectionRequestTimeout(1000).build();
        return HttpClientBuilder.create()
                .setConnectionManager(connectionManager)
                .setDefaultRequestConfig(requestConfig)
                .build();
    }

    @Bean(destroyMethod = "close")
    public PoolingHttpClientConnectionManager poolingHttpClientConnectionManager() {
        PoolingHttpClientConnectionManager connectionManager = new PoolingHttpClientConnectionManager();
        //连接池大小
        connectionManager.setMaxTotal(1000);
        //每个路由连接大小
        connectionManager.setDefaultMaxPerRoute(200);
        return connectionManager;
    }

至此为止,问题全部解决,该接口在100线程的压测下,从5~6QPS,增长到了100QPS。

最后补充说明下,RestTemplateFeign作为http调用工具,如果什么都不设置实际上默认使用的jdk自带的UrlConnection作为http客户端的,而这个UrlConnection是压根没有连接池的,不能支持高并发的场景。一般我们会选择apache httpClient或者是okHttp来替换。本文中的应用都是使用了apache httpClient的,对于使用了okHttp的应用,在高并发下连接池的配置也是必须的。

另外,对于Feign来说,如果要使用apache httpClient,只需要引入相应依赖就会自动替换UrlConnection

  
    io.github.openfeign
    feign-httpclient
    9.5.0
  

如果要使用okHttp,除了引入相应依赖外,还需将相关的配置项开启。

  
    io.github.openfeign
    feign-okhttp
    10.1.0
  
feign:
  httpclient:
    enabled:false
  okhttp:
    enabled:true

你可能感兴趣的:(高并发调优案例之httpClient)