Apache HttpClient使用不当导致的请求超时问题排查

一、背景

近期负责的线上应用出现调用失败的问题，排查后发现是HttpClient相关的问题，特此记录一下。

因为涉及线上数据安全隐私，模拟生产的应用写了demo，模拟一定的并发请求复现该问题。

1 问题介绍

收到线上告警通知：外部服务调用某应用A大量报错，平均响应时常很长，经常有请求超时的情况。

二、问题排查

1 初步排查

1.1 查连接情况

外部请求是http请求，所以可以查一下连接的情况。对应 java 进程的连接数：

具体的连接情况：

202207311129431.png

[root@sky ~]# netstat -tanp | grep 8433 | grep 8080 | wc -l
400

应用的 8080 端口有400个连接。

1.2 查线程情况

平均响应时常比较长，一般是线程长时间在等待某个步骤完成。查一下应用的线程执行情况：

执行命令 jstack 8433 | less

202207311117556.png

jstack的日志中有很多类似上面图中的线程，处于 TIME_WAITING状态，而下面的堆栈中，HttpClientUtils.httpGet()是使用 Apache HttpClient 工具封装的 http 请求函数，PoolHttpClientConnectionManager.leaseConnection() 看上去是在等待获取连接。大概分析得出：有很多线程在执行过程中，等待HttpClient获取连接。

1.3 查应用异常日志

查一下应用的日志，寻找线索，在日志中看到许多下面的异常日志：

2022-07-24 21:47:44.765 ERROR 82381 --- [pool-2-thread-7] com.skyme.TestController                 : error:

org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286) ~[httpclient-4.5.2.jar:4.5.2]
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263) ~[httpclient-4.5.2.jar:4.5.2]
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190) ~[httpclient-4.5.2.jar:4.5.2]
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[httpclient-4.5.2.jar:4.5.2]
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) ~[httpclient-4.5.2.jar:4.5.2]
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.2.jar:4.5.2]
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[httpclient-4.5.2.jar:4.5.2]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) ~[httpclient-4.5.2.jar:4.5.2]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) ~[httpclient-4.5.2.jar:4.5.2]
    at com.skyme.HttpClientUtils.httpGet(HttpClientUtils.java:31) ~[classes/:na]
    at com.skyme.TestController.lambda$batchHttpCall$0(TestController.java:47) ~[classes/:na]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_73]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_73]
    at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_73]

1.4 初步结论

外部有大量请求，很多请求的处理线程在等待 HttpClient 获取连接，问题应该在 HttpClient 。

2 进一步排查

2.1 分析 HttpClient

日志里两个关键信息：

1.是HttpClient内部报错

org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286) ~[httpclient-4.5.2.jar:4.5.2]

对应的源码org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286)：

protected HttpClientConnection leaseConnection(
        final Future future,
        final long timeout,
        final TimeUnit tunit) throws InterruptedException, ExecutionException, ConnectionPoolTimeoutException {
    final CPoolEntry entry;
    try {
        entry = future.get(timeout, tunit); // 该处会抛出TimeoutException
        if (entry == null || future.isCancelled()) {
            throw new InterruptedException();
        }
        Asserts.check(entry.getConnection() != null, "Pool entry with no connection");
        if (this.log.isDebugEnabled()) {
            this.log.debug("Connection leased: " + format(entry) + formatStats(entry.getRoute()));
        }
        return CPoolProxy.newProxy(entry);
    } catch (final TimeoutException ex) {
        // 捕获TimeoutException异常，再抛出ConnectionPoolTimeoutException
        throw new ConnectionPoolTimeoutException("Timeout waiting for connection from pool");
    }
}

抛出ConnectionPoolTimeoutException是因为触发了TimeoutException异常，而TimeoutException异常对应于entry = future.get(timeout, tunit);

PoolingHttpClientConnectionManager.leaseConnection 是为了获取连接，分析一下其执行流程，摘录一下网上博客（https://segmentfault.com/a/1190000012343705）获取connection时序图：

f3927e5b5ad94a949331a28870d49a95_tplv-k3u1fbpfcp-watermark.png

重点的步骤：

1.如果route下有空闲连接，则直接返回空闲连接

2.如果没有空闲连接，且该connPerRoute及池中conn数均没有达到最大值，的conn数没有达到最大值，则创建连接，并放入池中，并返回该连接

3.如果没有空闲连接，且达到了maxRoute或maxPoolsize，则阻塞等待，等待的时常，便是entry = future.get(timeout, tunit);中的 timeout 时长。如果指定的时间没有 lease 到 connection ，则 entry = future.get(timeout, tunit); 会抛出 TimeoutException 异常。

一般 pool 设计都是和上面的类似，我们继续分析一下 httpclient pool 的结构：

图片摘录自

6d5ff0d006a2437fa2703d0909cf4198_tplv-k3u1fbpfcp-watermark.png

PoolEntry: 路由和连接的对应关系。

routeToPool: 可以多个（图中仅示例两个），图中各队列大小动态变化，并不相等。

maxTotal: 限制的是外层 httpConnPool 中 leased 集合和 available 队列的总和的大小，httpConnPool 的 leased 和 available 的大小没有单独限制。

maxPerRoute: 限制的是每个 routeToPool 中 leased 集合和 available 队列的总和的大小。

一步步跟踪源码entry = future.get(timeout, tunit);，定位到核心调用处 org.apache.http.pool.AbstractConnPool#getPoolEntryBlocking() ，对应源码

private E getPoolEntryBlocking(
        final T route, final Object state,
        final long timeout, final TimeUnit tunit,
        final PoolEntryFuture future)
            throws IOException, InterruptedException, TimeoutException {

    Date deadline = null; // 阻塞等待的截止时间
    if (timeout > 0) { 
        deadline = new Date
            (System.currentTimeMillis() + tunit.toMillis(timeout));
    }
    // 首先对当前连接池加锁，当前锁是可重入锁ReentrantLockthis.lock.lock();
    this.lock.lock();
    try {
        // 获得一个当前HttpRoute对应的连接池，即routeToPool
        final RouteSpecificPool pool = getPool(route);
        E entry = null;
        while (entry == null) {
            Asserts.check(!this.isShutDown, "Connection pool shut down");
            for (;;) {
                entry = pool.getFree(state);
                if (entry == null) {
                    break;
                }
                // 如果拿到的是过期连接或者已关闭连接，就释放资源，继续循环获取
                if (entry.isExpired(System.currentTimeMillis())) {
                    entry.close();
                } else if (this.validateAfterInactivity > 0) {
                    if (entry.getUpdated() + this.validateAfterInactivity <= System.currentTimeMillis()) {
                        if (!validate(entry)) {
                            entry.close();
                        }
                    }
                }
                if (entry.isClosed()) {
                    this.available.remove(entry);
                    pool.free(entry, false);
                } else {
                    break;
                }
            }
            // 拿到未在使用中的(available)有效连接就退出循环，直接return对应连接
            if (entry != null) {
                this.available.remove(entry);
                this.leased.add(entry);
                onReuse(entry);
                return entry;
            }

            // New connection is needed。执行到这说明没有拿到连接，需要新建连接
            final int maxPerRoute = getMax(route);
            // Shrink the pool prior to allocating a new connection
            // pool.getAllocatedCount()=pool.available.size() + pool.leased.size()，可用的加正在租借的即当前池中已有的连接
            // 如果route对应的连接池的已有连接数超过了支持的最大连接数量(加1的目的？)，则通过LRU清理掉一些连接。
            final int excess = Math.max(0, pool.getAllocatedCount() + 1 - maxPerRoute);
            if (excess > 0) {
                for (int i = 0; i < excess; i++) {
                    final E lastUsed = pool.getLastUsed();
                    if (lastUsed == null) {
                        break;
                    }
                    lastUsed.close();
                    this.available.remove(lastUsed);
                    pool.remove(lastUsed);
                }
            }

            // 如果route对应的连接池的已有连接数未达到最大最大限制
            if (pool.getAllocatedCount() < maxPerRoute) {
                final int totalUsed = this.leased.size();
                // freeCapacity指大池子里除了已用连接，最多能容纳的空闲连接数(包括已经创建的available连接和后续还允许创建的available连接)
                final int freeCapacity = Math.max(this.maxTotal - totalUsed, 0);
                if (freeCapacity > 0) {
                    final int totalAvailable = this.available.size();
                    // 此时当前route肯定leased已满，无available，若大池子里已达到上限，且大池子里还有available的空闲连接，那可以清理其他route下的无available连接，为新连接挪出位置。所以close的是otherpool中lastUsed的available连接。若大池子没有达到上限，则没必要做清理。
                    if (totalAvailable > freeCapacity - 1) {
                        if (!this.available.isEmpty()) {
                            final E lastUsed = this.available.removeLast();
                            lastUsed.close();
                            final RouteSpecificPool otherpool = getPool(lastUsed.getRoute());
                            otherpool.remove(lastUsed);
                        }
                    }
                    // 创建一个新连接
                    final C conn = this.connFactory.create(route);
                    // 放入route对应的池子的leased集合
                    entry = pool.add(conn);
                    // 放入大池子的leased集合
                    this.leased.add(entry);
                    // reutrn该连接
                    return entry;
                }
            }

            // 到这里证明没有从获得route池中获得有效连接，并且想要自己建立连接时当前route连接池已经到达最大值，即已经有连接在使用，但是对当前线程不可用
            boolean success = false;
            try {
                // 将future放入route池中等待，实际就是加入route池的pending集合
                pool.queue(future);
                // 放入大池子的pending集合
                this.pending.add(future);
                // 如果等待到了信号量的通知,success为true。如果到截止时间还未等到，则success为false
                success = future.await(deadline);
            } finally {
                // In case of 'success', we were woken up by the
                // connection pool and should now have a connection
                // waiting for us, or else we're shutting down.
                // Just continue in the loop, both cases are checked.
                // fanally，从等待队列中移除
                pool.unqueue(future);
                this.pending.remove(future);
            }
            // check for spurious wakeup vs. timeout
            // 如果没有等到信号量通知并且当前时间已经超时，则退出循环
            if (!success && (deadline != null) &&
                (deadline.getTime() <= System.currentTimeMillis())) {
                break;
            }
        }
        // 最终也没有等到信号量通知，没有拿到可用连接，则抛异常
        throw new TimeoutException("Timeout waiting for connection");
    } finally {
        // 释放对大池子的锁
        this.lock.unlock();
    }
}

综合分析：

通过线程池获取连接要通过 ReetrantLock 加锁，保证线程安全
不论是大连接池还是小连接池，当超过数量的时候，都要通过LRU释放一些连接
如果拿到了可用连接，则返回给上层使用
如果没有拿到可用连接，HttpClient 会判断当前 route 连接池是否已经超过了最大数量，没有到上限就会新建一个连接，并放入池中
如果到达了上限，就排队等待，等到了信号量，就重新获得一次，等待不到就抛 TimeoutException

思考

池类的设计，不管是连接池，线程池，一般不够用的场景，都是最大数设置的不够大，不足以应对并发获取连接、线程等？

2.2 确认HttpClient 实际情况

上面的都是猜想，最终问题定位还是要摸清实际情况。

应用里是如何创建 HttpClient 实例

下面是应用中创建 HttpClient 实例的地方，我们从变量 HTTP_CLIENT开始定位。

    private static final CloseableHttpClient HTTP_CLIENT = HttpClients
            .custom()
            .setDefaultRequestConfig(
                    RequestConfig.custom()
                    .setCookieSpec(CookieSpecs.IGNORE_COOKIES)
                    .setConnectionRequestTimeout(30000)
                    .build()
            ).build();

上面代码可知：

1.没有设置 maxPerRoute、maxTotal这类的参数。

2.从连接池获取连接的超时时间的参数 ConnectionRequestTimeout ，设置为30秒。

arthas 定位实际数据

因为不是本地环境，不能方便地 Debug，这时可以用 arthas 来查进行内存中 HttpClient 的实际数据。

在arthas中执行：getstatic com.skyme.HttpClientUtils HTTP_CLIENT -x 4，结果如下：

202207311205785.png

routeToPool中，route [https//github.com] 的使用情况是：leased为2，avaibable为0，pending为198。

leased + avaibable 结果为2，怎么这么少？难道是 maxPerRoute 参数设置的过小？

在arthas结果中我们找到了 maxPerRoute 相关的配置，maxPerRoute 默认值只有2，因为我们创建 HttpClient 实例时，没有设置maxPerRoute的数值，所以每个route 的 maxPerRoute 为2，这和上面的 leased + avaibable 的结果对应上了。此外，还有一个重要参数 maxToal 为20。

202207311359531.png

大致结论：默认参数太小

到这里，问题基本定位清楚了。因为应用没有显示地设置 maxPerRoute 和 maxToal ，所以取了默认值，默认值比较小，maxPerRoute 为2， maxToal 为20，导致并发获取连接时，连接不够用，导致了获取连接的线程一直在等待，等待时间为我们创建 HttpClien 实例时自定义的 30 秒，等待时间过长，导致了外部请求超时。

三、解决办法

1 临时办法

可以使用 arthas 的 vmtool 命令将内存中的 defaultMaxPerRoute 、maxTotal 参数修改大一些，该操作过于风骚，一般不符合生产环境的运维操作规范。但是测试环境或者紧急情况可以尝试使用。

[arthas@8433]$ vmtool --action getInstances --className org.apache.http.impl.client.CloseableHttpClient --express 'instances[0].connManager.pool.defaultMaxPerRoute=100'
@Integer[100]
[arthas@8433]$ vmtool --action getInstances --className org.apache.http.impl.client.CloseableHttpClient --express 'instances[0].connManager.pool.maxTotal=500'

修改后，连接池中 leased 明显增多，pending 减少。

202207311358937.png

一段时间后，pending为0，说明没有阻塞，连接数已足够使用。

202207311359531.png

注意：

1.该修改不需要应用重启，但应用重启后，内存的设置将销毁，依然还会存在问题。

2.非常规操作，谨慎执行。

3.vmtool 命令在 arthas 3.5.1 版本之后才有。

2 长期办法

修改代码，配置合理的最大连接数maxPerRoute、maxTotal、ConnectionRequestTimeout。

当前 HttpClient 版本为 4.5.2，其他版本可能代码有所差异。

    private final static PoolingHttpClientConnectionManager POOLING_HTTP_CLIENT_CONNECTION_MANAGER
            = new PoolingHttpClientConnectionManager();

    static {
        // 每个route默认的最大连接数
        POOLING_HTTP_CLIENT_CONNECTION_MANAGER.setDefaultMaxPerRoute(100);
        // 总的连接池最大连接数
        POOLING_HTTP_CLIENT_CONNECTION_MANAGER.setMaxTotal(500);
    }

    private static final CloseableHttpClient HTTP_CLIENT = HttpClients
        .custom()
        .setConnectionManager(POOLING_HTTP_CLIENT_CONNECTION_MANAGER)
        .setDefaultRequestConfig(
                RequestConfig.custom()
                .setCookieSpec(CookieSpecs.IGNORE_COOKIES)//
                // 从连接池获取连接的超时时间
                .setConnectionRequestTimeout(10000)
                .build()
        ).build();

defaultMaxPerRoute 设置为100，maxTotal 设置为500，并将 connectionRequestTimeout 从30秒减少到10秒，没有拿到连接快速失败，防止过多的线程阻塞导致挤压。

四、总结

1.问题原因是 HttpClient 使用时没有对参数足够了解，对于连接池的框架，最好设置最大数，获取连接的等待超时时间，以及回收的周期，连接是否可用的校验等等参数。其他池类的框架同理，例如线程池。

2.通过源码、日志、线程、连接等多方面进行分析，结论会更准确。

3.arthas 是神器，可以多尝试。

参考：

[Http持久连接与HttpClient连接池]https://juejin.cn/post/6844903986705727495