网关超时等待引发的血案

网关为ZUUL

事由:

2018年的1024。本该是一个和平,宁静的周三,本该是一个享受美女按摩的节日~
然鹅还没有开始上班,噩梦就已经开始。
早上八点半到公司,闲着没事儿逛Kr,突然业务群反馈用户使用APP请求一直在转圈。

网关超时等待引发的血案_第1张图片

半小时过去了,定位到问题—— API网关服务挂了。
网关服务挂了……网关服务……挂了。。。
jstack一查。250个线程在跑。100个线程挂起。
链路日志一查,一大半的请求超过10分钟。。 10分钟???WTF?你是凯丁吗?

网关超时等待引发的血案_第2张图片

最后定位到某个服务调用疯狂超时,导致其他服务一直等待线程资源。

分析

网关超时等待引发的血案_第3张图片

首先不能理解的是,为什么又那么多执行在10分钟的(调用日志从调用开始到调用结束),而且还超时了。因为网关服务又设置超时时间60s。

zuul.host.socket-timeout-millis=60000
zuul.host.connect-timeout-millis=2000

然后在服务的nginx上也有超时时间设置 75s
可是程序里面记录的时间不可能骗人啊。一定有别的地方在消耗这个时间。具体是哪儿呢。

疯狂跟踪代码之后发现了。
时间耗在了等待线程上面。没错,耗费在了等待线程上面。

SimpleHostRoutingFilter->run()->forward()->forwardRequest()->CloseableHttpClient.execute()->InternalHttpClient.doExecute()
在InternalHttpClient会创建或者获取RequestConfig, 如果没有获取到RequestConfig就会使用HttpClient的defaultConfig ,通过setupContext
入下:

        try {
            final HttpRequestWrapper wrapper = HttpRequestWrapper.wrap(request, target);
            final HttpClientContext localcontext = HttpClientContext.adapt(
                    context != null ? context : new BasicHttpContext());
            RequestConfig config = null;
            if (request instanceof Configurable) {
                config = ((Configurable) request).getConfig();
            }
//如果Config为空就会创建一个基础的Config
            if (config == null) {
                final HttpParams params = request.getParams();
                if (params instanceof HttpParamsNames) {
                    if (!((HttpParamsNames) params).getNames().isEmpty()) {
                        config = HttpClientParamConfig.getRequestConfig(params, this.defaultConfig);
                    }
                } else {
                    config = HttpClientParamConfig.getRequestConfig(params, this.defaultConfig);
                }
            }
            if (config != null) {
                localcontext.setRequestConfig(config);
            }
            //设置HTTPClient的默认属性到HttpRequestConfig
            setupContext(localcontext);
            final HttpRoute route = determineRoute(target, wrapper, localcontext);
            return this.execChain.execute(route, wrapper, localcontext, execAware);
        } catch (final HttpException httpException) {
            throw new ClientProtocolException(httpException);
        }
    private void setupContext(final HttpClientContext context) {
        if (context.getAttribute(HttpClientContext.TARGET_AUTH_STATE) == null) {
            context.setAttribute(HttpClientContext.TARGET_AUTH_STATE, new AuthState());
        }
        if (context.getAttribute(HttpClientContext.PROXY_AUTH_STATE) == null) {
            context.setAttribute(HttpClientContext.PROXY_AUTH_STATE, new AuthState());
        }
        if (context.getAttribute(HttpClientContext.AUTHSCHEME_REGISTRY) == null) {
            context.setAttribute(HttpClientContext.AUTHSCHEME_REGISTRY, this.authSchemeRegistry);
        }
        if (context.getAttribute(HttpClientContext.COOKIESPEC_REGISTRY) == null) {
            context.setAttribute(HttpClientContext.COOKIESPEC_REGISTRY, this.cookieSpecRegistry);
        }
        if (context.getAttribute(HttpClientContext.COOKIE_STORE) == null) {
            context.setAttribute(HttpClientContext.COOKIE_STORE, this.cookieStore);
        }
        if (context.getAttribute(HttpClientContext.CREDS_PROVIDER) == null) {
            context.setAttribute(HttpClientContext.CREDS_PROVIDER, this.credentialsProvider);
        }
      //重点,设置默认的Config 
        if (context.getAttribute(HttpClientContext.REQUEST_CONFIG) == null) {
            context.setAttribute(HttpClientContext.REQUEST_CONFIG, this.defaultConfig);
        }
    }

由于传入的contxt为null 所以会创建一个BasicHttpContext
接着
RedirectExec.execute
RetryExec.execute
ProtocolExec.execute
最终通过执行MainClientExec.execute
从连接池获取连接

        try {
            //获取连接超时时间
            final int timeout = config.getConnectionRequestTimeout();
            //获取连接
            managedConn = connRequest.get(timeout > 0 ? timeout : 0, TimeUnit.MILLISECONDS);
        } catch(final InterruptedException interrupted) {
            Thread.currentThread().interrupt();
            throw new RequestAbortedException("Request aborted", interrupted);
        } catch(final ExecutionException ex) {
            Throwable cause = ex.getCause();
            if (cause == null) {
                cause = ex;
            }
            throw new RequestAbortedException("Request execution failed", cause);
        }

获取连接
PoolingHttpClientConnectionManager.leaseConnection()
AbstractConnPool.get

            @Override
            public E get(final long timeout, final TimeUnit tunit) throws InterruptedException, ExecutionException, TimeoutException {
                if (entry != null) {
                    return entry;
                }
                synchronized (this) {
                    try {
                        for (;;) {
                            //阻塞获取连接资源
                            final E leasedEntry = getPoolEntryBlocking(route, state, timeout, tunit, this);
                            if (validateAfterInactivity > 0)  {
                                if (leasedEntry.getUpdated() + validateAfterInactivity <= System.currentTimeMillis()) {
                                    if (!validate(leasedEntry)) {
                                        leasedEntry.close();
                                        release(leasedEntry, false);
                                        continue;
                                    }
                                }
                            }
                            entry = leasedEntry;
                            done = true;
                            onLease(entry);
                            if (callback != null) {
                                callback.completed(entry);
                            }
                            return entry;
                        }
                    } catch (IOException ex) {
                        done = true;
                        if (callback != null) {
                            callback.failed(ex);
                        }
                        throw new ExecutionException(ex);
                    }
                }
            }

        };

AbstractConnPool.getPoolEntryBlocking
看这个名字就知道。这是一个阻塞获取池资源的方法
注意 高能来了。

    private E getPoolEntryBlocking(
            final T route, final Object state,
            final long timeout, final TimeUnit tunit,
            final Future future) throws IOException, InterruptedException, TimeoutException {

        Date deadline = null;
        if (timeout > 0) {
            deadline = new Date (System.currentTimeMillis() + tunit.toMillis(timeout));
        }
        this.lock.lock();
        try {
            final RouteSpecificPool pool = getPool(route);
            E entry;
            for (;;) {
                Asserts.check(!this.isShutDown, "Connection pool shut down");
                for (;;) {
                    entry = pool.getFree(state);
                    if (entry == null) {
                        break;
                    }
                    if (entry.isExpired(System.currentTimeMillis())) {
                        entry.close();
                    }
                    if (entry.isClosed()) {
                        this.available.remove(entry);
                        pool.free(entry, false);
                    } else {
                        break;
                    }
                }
                if (entry != null) {
                    this.available.remove(entry);
                    this.leased.add(entry);
                    onReuse(entry);
                    return entry;
                }

                // New connection is needed
                final int maxPerRoute = getMax(route);
                // Shrink the pool prior to allocating a new connection
                final int excess = Math.max(0, pool.getAllocatedCount() + 1 - maxPerRoute);
                if (excess > 0) {
                    for (int i = 0; i < excess; i++) {
                        final E lastUsed = pool.getLastUsed();
                        if (lastUsed == null) {
                            break;
                        }
                        lastUsed.close();
                        this.available.remove(lastUsed);
                        pool.remove(lastUsed);
                    }
                }

                if (pool.getAllocatedCount() < maxPerRoute) {
                    final int totalUsed = this.leased.size();
                    final int freeCapacity = Math.max(this.maxTotal - totalUsed, 0);
                    if (freeCapacity > 0) {
                        final int totalAvailable = this.available.size();
                        if (totalAvailable > freeCapacity - 1) {
                            if (!this.available.isEmpty()) {
                                final E lastUsed = this.available.removeLast();
                                lastUsed.close();
                                final RouteSpecificPool otherpool = getPool(lastUsed.getRoute());
                                otherpool.remove(lastUsed);
                            }
                        }
                        final C conn = this.connFactory.create(route);
                        entry = pool.add(conn);
                        this.leased.add(entry);
                        return entry;
                    }
                }

                boolean success = false;
                try {
                    if (future.isCancelled()) {
                        throw new InterruptedException("Operation interrupted");
                    }
                    pool.queue(future);
                    this.pending.add(future);
                    if (deadline != null) {
                        success = this.condition.awaitUntil(deadline);
                    } else {
                        this.condition.await();
                        success = true;
                    }
                    if (future.isCancelled()) {
                        throw new InterruptedException("Operation interrupted");
                    }
                } finally {
                    // In case of 'success', we were woken up by the
                    // connection pool and should now have a connection
                    // waiting for us, or else we're shutting down.
                    // Just continue in the loop, both cases are checked.
                    pool.unqueue(future);
                    this.pending.remove(future);
                }
                // check for spurious wakeup vs. timeout
                if (!success && (deadline != null && deadline.getTime() <= System.currentTimeMillis())) {
                    break;
                }
            }
            throw new TimeoutException("Timeout waiting for connection");
        } finally {
            this.lock.unlock();
        }
    }

这段代码有点长。分开来分析一下这个获取池资源的代码:
1.代码已建立有一个deadline ,然后判断timeout ,这个timeout要注意。如果大于零才会赋值deadline, 如果为0 则不会赋值deadline 也就是说deadline始终为null

        Date deadline = null;
        if (timeout > 0) {
            //如果超时时间有效,则设定deadline
            deadline = new Date (System.currentTimeMillis() + tunit.toMillis(timeout));
        }

2.进入锁代码。pool.getFree 获取池资源。如果获取到了,并且Connect的检验并没有被关闭,则直接return entry

                Asserts.check(!this.isShutDown, "Connection pool shut down");
                for (;;) {
                    //获取池资源
                    entry = pool.getFree(state);
                    if (entry == null) {
                        break;
                    }
                    //校验超时
                    if (entry.isExpired(System.currentTimeMillis())) {
                        entry.close();
                    }
                    if (entry.isClosed()) {
                        this.available.remove(entry);
                        pool.free(entry, false);
                    } else {
                        break;
                    }
                }
                if (entry != null) {
                    this.available.remove(entry);
                    this.leased.add(entry);
                    onReuse(entry);
                    return entry;
                }

3.如果没有获取到 进行接下来的代码。
4.判断是否达到了host配置的最大池数量,是否需要增加, 如果需要增加,则会在增加新连接之前缩小池,然后再分配返回entry

                // New connection is needed  获取是否需要创建新的连接
                final int maxPerRoute = getMax(route);
                // Shrink the pool prior to allocating a new connection
                final int excess = Math.max(0, pool.getAllocatedCount() + 1 - maxPerRoute);
                if (excess > 0) {
                    for (int i = 0; i < excess; i++) {
                        final E lastUsed = pool.getLastUsed();
                        if (lastUsed == null) {
                            break;
                        }
                        lastUsed.close();
                        this.available.remove(lastUsed);
                        pool.remove(lastUsed);
                    }
                }

                if (pool.getAllocatedCount() < maxPerRoute) {
                    final int totalUsed = this.leased.size();
                    final int freeCapacity = Math.max(this.maxTotal - totalUsed, 0);
                    if (freeCapacity > 0) {
                        final int totalAvailable = this.available.size();
                        if (totalAvailable > freeCapacity - 1) {
                            if (!this.available.isEmpty()) {
                                final E lastUsed = this.available.removeLast();
                                lastUsed.close();
                                final RouteSpecificPool otherpool = getPool(lastUsed.getRoute());
                                otherpool.remove(lastUsed);
                            }
                        }
                        final C conn = this.connFactory.create(route);
                        entry = pool.add(conn);
                        this.leased.add(entry);
                        return entry;
                    }
                }

6.如果并不是上面的情况,实际情况就是池子被用光了,而且还达到了最大。就不能从池子中获取资源了。只能等了……
7.等待的时候会判断deadline , 如果deadline不为null 就会await一个时间。如果为null,那么等待就会无限等待,直到有资源。

                boolean success = false;
                try {
                    if (future.isCancelled()) {
                        throw new InterruptedException("Operation interrupted");
                    }
                    pool.queue(future);
                    this.pending.add(future);
                    //判断deadline是否有效
                    if (deadline != null) {
                        //如果有效就等待至deadline
                        success = this.condition.awaitUntil(deadline);
                    } else {
                       //如果无效就一直等待,没有超时时间
                        this.condition.await();
                        success = true;
                    }
                    if (future.isCancelled()) {
                        throw new InterruptedException("Operation interrupted");
                    }
                } finally {
                    // In case of 'success', we were woken up by the
                    // connection pool and should now have a connection
                    // waiting for us, or else we're shutting down.
                    // Just continue in the loop, both cases are checked.
                    pool.unqueue(future);
                    this.pending.remove(future);
                }

总结

分析到这儿事情就已经明了了。
1.有一个后端服务因为调用第三方导致完全处于宕机状态,所有gw过去的请求都会超时。
2.由于这个服务的请求又特别多,导致GW分给这个服务的连接池耗尽无法获取到连接资源,导致资源请求线程一直积累在GW
3.GW的对应这个服务的线程数一直在增加,导致别的服务也无法正常工作。


网关超时等待引发的血案_第4张图片

处理

其实很简单,加入一个timeout 就可以了。
这个timeout是等待池资源的超时时间。
Zuul中,重写SimpleHostRoutingFilter ,重写创建HTTPClient, RequestConfig中设置了ConnectionRequestTimeout

    protected CloseableHttpClient newClient() {
        if(connectionRequestTimeout ==  null || connectionRequestTimeout <= 0){
            connectionRequestTimeout = 60;
        }
        final RequestConfig requestConfig = RequestConfig.custom()
                //设置socket 时间长度
                .setSocketTimeout(SOCKET_TIMEOUT.get())
                //设置连接时长
                .setConnectTimeout(CONNECTION_TIMEOUT.get())
                 //设置等待时长
                .setConnectionRequestTimeout(connectionRequestTimeout)
                .setCookieSpec(CookieSpecs.IGNORE_COOKIES).build();

        HttpClientBuilder httpClientBuilder = HttpClients.custom();
        if (!this.sslHostnameValidationEnabled) {
            httpClientBuilder.setSSLHostnameVerifier(NoopHostnameVerifier.INSTANCE);
        }
        return httpClientBuilder.setConnectionManager(newConnectionManager())
                .disableContentCompression()
                .useSystemProperties().setDefaultRequestConfig(requestConfig)
                .setRetryHandler(new DefaultHttpRequestRetryHandler(0, false))
                .setRedirectStrategy(new RedirectStrategy() {
                    @Override
                    public boolean isRedirected(HttpRequest request,
                                                HttpResponse response, HttpContext context)
                            throws ProtocolException {
                        return false;
                    }

                    @Override
                    public HttpUriRequest getRedirect(HttpRequest request,
                                                      HttpResponse response, HttpContext context)
                            throws ProtocolException {
                        return null;
                    }
                }).build();
    }

你可能感兴趣的:(网关超时等待引发的血案)