Spring Cloud Gateway之踩坑日记

目录

一、背景

二、踩坑经历

坑一:通过SCG的GlobalFilter记录的网关处理耗时不准

坑二:reactor-netty的epoll&kqueue模式

坑三:SCG的同步更新路由信息

坑四:Ribbon的懒加载

坑五:堆外内存泄露

坑六:无法突破的200QPS

三、总结


一、背景

楼主所在的团队全面拥抱了Spring Cloud体系,但由于历史原因,以及使用了腾讯云TSF的老版本,加上开发自维护的基础工具包一掺和,所有项目使用的Spring Cloud都停留在 2.1.2.RELEASE 版本,所以Spring Cloud Gateway(后面简称SCG)使用的是 2.1.2.RELEASE 版本。我们知道 SCG 是基于 Spring WebFlux 而构建的专属网关系统,而 Spring WebFlux 则是和 Spring MVC 一样,基于 Spring Web 而构建,而 Spring WebFlux 则是因为将 Spring MVC “Reactor化”成本很高而且不好维护而生成的新产品。17年的 Spring Web 就已经支持了响应流,我们可以看下其Gradle文件:

dependencyManagement {
	imports {
		mavenBom "io.projectreactor:reactor-bom:${reactorVersion}"
		mavenBom "io.netty:netty-bom:${nettyVersion}"
		mavenBom "org.eclipse.jetty:jetty-bom:${jettyVersion}"
	}
}

其中就有reactor-bom。说实话,当时在针对网关选型时,抛弃了Zuul而选择SCG,但没想到只能用SCG的 2.1.2.RELEASE 版本,该版本于19年6月发布,加上本来就是Spring Cloud家族中的新星,一切都在快速迭代和升级中,所以虽然距现在不到两年,但实际上SCG的模块已经进行大幅调整(新版本连spring-cloud-gateway-core都不存在了,改成了spring-cloud-gateway-server)。本文将详细阐述我遇到的其中的几个坑(该坑在新版本已经被修复)。

二、踩坑经历

这里先阐述下我们基于SCG的自建网关的位置(得搞清楚自己的定位):

Spring Cloud Gateway之踩坑日记_第1张图片

坑一:通过SCG的GlobalFilter记录的网关处理耗时不准

我们知道,想记录SCG的请求处理耗时没那么简单,我们之前使用的是GlobalFilter,创建了一个LogGlobalFilter,将其Order设置的小一点(比如小于0),这样执行顺序会靠前点,方法执行时记录一个时间,返回时记录另一个时间,代码片段如下:

public class LogGlobalFilter extends AbstractGlobalFilter {

    private ModifyResponseBodyGatewayFilterFactory factory = new ModifyResponseBodyGatewayFilterFactory();
    private GatewayFilter modifyResponseBodyGatewayFilter;

    @PostConstruct
    public void init() {
        ModifyResponseBodyGatewayFilterFactory.Config config = new ModifyResponseBodyGatewayFilterFactory.Config();
        config.setInClass(String.class);
        config.setOutClass(String.class);
        config.setRewriteFunction(new GatewayResponseRewriteFunction());

        modifyResponseBodyGatewayFilter = factory.apply(config);
    }

    @Override
    public Mono filter(ServerWebExchange exchange, GatewayFilterChain chain) {

        exchange.getAttributes().put(REQUEST_START_NANO_TIME_ATTRIBUTE, System.nanoTime());

        return modifyResponseBodyGatewayFilter.filter(exchange, chain).doOnSuccess(
            (Void v) -> {
                Long reqStartNanoTime = exchange.getAttribute(REQUEST_START_NANO_TIME_ATTRIBUTE);
                StringBuilder logStr = new StringBuilder("调用成功")
                        .append(", Gateway应答:").append((String) exchange.getAttribute(RESPONSE_BODY_ATTRIBUTE))
                        .append(", 耗时(毫秒):").append(reqStartNanoTime == null ?
                                "计算失败" : (System.nanoTime() - reqStartNanoTime.longValue()) / 1000000);
                log.info(logStr.toString());
            }
        );
    }

    private static class GatewayResponseRewriteFunction implements RewriteFunction {
        @Override
        public Publisher apply(ServerWebExchange exchange, String body) {
            exchange.getAttributes().put(RESPONSE_BODY_ATTRIBUTE, body);
            return Mono.just(body);
        }
    }
}

这里表面上看没啥问题,LogGlobalFilter的Order值小于0,算是还比较高的执行优先级,而且我们在 filter 方法的开始记录了一次系统本地时间,在 doOnSuccess 方法中记录了应答的时间,两个时间一减,可以大致得出请求处理耗时。

客户端服务器设置了数秒的超时时间,QA同学在App上测试时,时不时报一些超时,我们通过traceId来看,发现网关的入口时间和客户端服务器的请求时间差了一两秒,刚开始怀疑是不是外网环境不稳定,后面发现不应该,而且还有一个奇怪现象,SkyWalking上显示的网关请求到达时间比LogGlobalFilter要早一两秒,就是说SkyWalking上显示的请求到达时间才是符合预期的(和客户端服务发起时间相差几十毫秒)。

奇怪,SkyWalking是如何做到比LogGlobalFilter更准确(更早)的统计到请求入口时间的???带着这个疑问,我顺便看了下SkyWalking的代码:

Spring Cloud Gateway之踩坑日记_第2张图片

我们发现SkyWalking是针对 Spring WebFlux 的核心消息派发处理器 DispatcherHandler 做了字节码增强(可以理解类似AOP的效果)来统计这个时间的,于是我们修改了记录请求入口时间的策略,通过对 DispatcherHandler 做 AOP 来记录请求入口时间,以下是代码片段:

@Component
public class DispatcherHandlerMethodInterceptor implements MethodInterceptor {

    @Override
    public Object invoke(MethodInvocation methodInvocation) throws Throwable {

        if ("handle".equals(methodInvocation.getMethod().getName()) &&
                methodInvocation.getArguments().length == 1 &&
                methodInvocation.getArguments()[0] instanceof ServerWebExchange) {

            ServerWebExchange exchange = (ServerWebExchange) methodInvocation.getArguments()[0];
            // 记录请求开始时间
            exchange.getAttributes().put(REQUEST_START_NANO_TIME_ATTRIBUTE, System.nanoTime());
          

            log.info("Gateway receive request, path:{}, header:{}, params:{}",
                    exchange.getRequest().getPath(), exchange.getRequest().getHeaders(),
                    exchange.getRequest().getQueryParams());

        }

        return methodInvocation.proceed();
    }
}


@Import({DispatcherHandlerMethodInterceptor.class})
@Configuration
public class ConfigurableAdvisorConfig {

    private static final String DISPATCHER_HANDLER_POINTCUT =
            "execution(public * org.springframework.web.reactive.DispatcherHandler.handle(..))";

    @Autowired
    private DispatcherHandlerMethodInterceptor dispatcherHandlerMethodInterceptor;


    @Bean
    public AspectJExpressionPointcutAdvisor buildDispatcherHandlerPointcutAdvisor() {
        AspectJExpressionPointcutAdvisor advisor = new AspectJExpressionPointcutAdvisor();
        advisor.setExpression(DISPATCHER_HANDLER_POINTCUT);
        advisor.setAdvice(dispatcherHandlerMethodInterceptor);
        return advisor;
    }
}

坑二:reactor-netty的epoll&kqueue模式

我们绝大部分程序员都知道Netty是一个优秀的IO库,但可能没听过 reactor-netty(https://github.com/reactor/reactor-netty),简单来说 reactor-netty 是基于 Netty 的一个 非阻塞背压 客户端和服务端框架(Reactor Netty offers non-blocking and backpressure-ready TCP/HTTP/UDP clients & servers based on Netty framework.)。SCG是强依赖 reactor-netty 的(更准确的说是Spring WebFlux依赖reactor-netty),SCG的 2.1.2.RELEASE 版本依赖的是 reactor-netty 的0.8.9.RELEASE 版本,而 reactor-netty 还直接依赖了 reactor-core(https://github.com/reactor/reactor-core) 的 3.2.10.RELEASE 版本,都是比较老的版本了,看其RELEASE的发布记录,reactor-netty 有些性能的bug在新版本被修复,但无奈也不能用太新的版本,毕竟这库也是底层Spring Web所依赖的,所以最后把 reactor-netty 从 0.8.9.RELEASE 升级到了 0.8.23.RELEASE,reactor-core 也相应的从 3.2.10.RELEASE 升级到 3.2.22.RELEASE。但仍然摆脱不了客户端服务器来调用的超时问题,于是乎在想是不是IO线程哪里阻塞了?超时就那么几秒钟,很难直接在超时现场jstack,于是先临时jstack看看到底有哪些线程,毕竟我们只给网关服务分配了6核8G。临时jstack后,发现了竟然有32个叫 “reactor-http-epoll-x” 的线程:

"reactor-http-epoll-32" #204 daemon prio=5 os_prio=0 tid=0x00007f020c1e6320 nid=0xdd runnable [0x00007f0200bf7000]
   java.lang.Thread.State: RUNNABLE
	at io.netty.channel.epoll.Native.epollWait0(Native Method)
	at io.netty.channel.epoll.Native.epollWait(Native.java:114)
	at io.netty.channel.epoll.EpollEventLoop.epollWait(EpollEventLoop.java:256)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:281)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
	- None

"reactor-http-epoll-31" #203 daemon prio=5 os_prio=0 tid=0x00007f020c068df0 nid=0xdc runnable [0x00007f0200cf8000]
   java.lang.Thread.State: RUNNABLE
	at io.netty.channel.epoll.Native.epollWait0(Native Method)
	at io.netty.channel.epoll.Native.epollWait(Native.java:114)
	at io.netty.channel.epoll.EpollEventLoop.epollWait(EpollEventLoop.java:256)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:281)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
	- None

"reactor-http-epoll-30" #202 daemon prio=5 os_prio=0 tid=0x00007f020c0678e0 nid=0xdb runnable [0x00007f0200df9000]
   java.lang.Thread.State: RUNNABLE
	at io.netty.channel.epoll.Native.epollWait0(Native Method)
	at io.netty.channel.epoll.Native.epollWait(Native.java:114)
	at io.netty.channel.epoll.EpollEventLoop.epollWait(EpollEventLoop.java:256)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:281)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
	- None

突然想起JDK8还没修复的一个问题,Java运行时取CPU核数没有考虑容器能用的资源,取的是宿主机的CPU核数,正好是32,是不是正是因为epoll线程过多,导致流量增加时CPU消耗增加而容易超时?为了验证这个问题,我们得想办法把epoll线程数改成6,匹配该Docker实例能使用的CPU数。怎么改?网上找了一堆资料没见着,于是还是得回归源码,翻了下 reactor-netty 的源码,如下:

@FunctionalInterface
public interface LoopResources extends Disposable {

	/**
	 * Default worker thread count, fallback to available processor
	 * (but with a minimum value of 4)
	 */
	int DEFAULT_IO_WORKER_COUNT = Integer.parseInt(System.getProperty(
			ReactorNetty.IO_WORKER_COUNT,
			"" + Math.max(Runtime.getRuntime()
			            .availableProcessors(), 4)));
	/**
	 * Default selector thread count, fallback to -1 (no selector thread)
	 */
	int DEFAULT_IO_SELECT_COUNT = Integer.parseInt(System.getProperty(
			ReactorNetty.IO_SELECT_COUNT,
			"" + -1));

    // 其他代码略
}


public final class ReactorNetty {

	// System properties names


	/**
	 * Specifies whether the channel ID will be prepended to the log message when possible.
	 * By default it will be prepended.
	 */
	static final boolean LOG_CHANNEL_INFO =
			Boolean.parseBoolean(System.getProperty("reactor.netty.logChannelInfo", "true"));

	/**
	 * Default worker thread count, fallback to available processor
	 * (but with a minimum value of 4)
	 */
	public static final String IO_WORKER_COUNT = "reactor.netty.ioWorkerCount";
	/**
	 * Default selector thread count, fallback to -1 (no selector thread)
	 */
	public static final String IO_SELECT_COUNT = "reactor.netty.ioSelectCount";

    // 其他代码略
}

答案显而易见了,我们可以通过设置系统Property来进行强行覆盖和修改,为了简单起见,直接修改网关的main方法:

@SpringBootApplication
public class GatewayServerApplication {

    public static void main(String[] args) {
        System.setProperty(ReactorNetty.IO_WORKER_COUNT, "6");
        System.setProperty(ReactorNetty.IO_SELECT_COUNT, "6");

        SpringApplication.run(GatewayServerApplication.class, args);
    }
}

改完并发布后,本地通过JMeter进行了简单的压测,发现同样的发压条件,6个epoll线程和32个epoll相比确实很稳健很多,超时的请求数量少了一大截。于是又去Jstack了一把,发现竟然不是6个epoll线程了,而是6个 "reactor-http-nio-x" 线程:

"reactor-http-nio-5" #241 daemon prio=5 os_prio=0 tid=0x00007f348001a2a0 nid=0x105 runnable [0x00007f33f03f1000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
	- locked <0x000000069b9e6eb0> (a io.netty.channel.nio.SelectedSelectionKeySet)
	- locked <0x000000069b9e6f28> (a java.util.Collections$UnmodifiableSet)
	- locked <0x000000069ba9e2f0> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
	at io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
	at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:791)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:439)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
	- None

"reactor-http-nio-4" #240 daemon prio=5 os_prio=0 tid=0x00007f3480024850 nid=0x104 runnable [0x00007f33f04f2000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
	- locked <0x000000069b9e7330> (a io.netty.channel.nio.SelectedSelectionKeySet)
	- locked <0x000000069b9e7078> (a java.util.Collections$UnmodifiableSet)
	- locked <0x000000069ba9e380> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
	at io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
	at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:791)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:439)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
	- None

这比较奇怪,不管了,还有其他事情要忙。网关经过几次迭代和部署后,几天后又有QA同学反馈客户端服务器调用网关超时,排除了网络的因素外,不得不又转向网关,于是又去jstack了一把,发现竟然又都变成6个epoll线程了,太奇怪了,得想办法还原回nio线程(不知道为啥我十分确信这个版本的nio的实现比epoll要好),于是又去翻 reactor-netty 的代码,如下:

final class HttpServerBind extends HttpServer
		implements Function {

	@Override
	public ServerBootstrap apply(ServerBootstrap b) {
		HttpServerConfiguration conf = HttpServerConfiguration.getAndClean(b);


		if (b.config()
		     .group() == null) {
			LoopResources loops = HttpResources.get();

			// 注意这里,根据 LoopResources.DEFAULT_NATIVE 来选择 EventLoopGroup
			EventLoopGroup selector = loops.onServerSelect(LoopResources.DEFAULT_NATIVE);
			EventLoopGroup elg = loops.onServer(LoopResources.DEFAULT_NATIVE);

			b.group(selector, elg)
			 .channel(loops.onServerChannel(elg));
		}
    // 省略了部分代码
}


final class DefaultLoopResources extends AtomicLong implements LoopResources {

	@Override
	public EventLoopGroup onServerSelect(boolean useNative) {
        // 如果 useNative 为true,并且 preferNative() 判断本地支持epoll或kqueue,那么就用本地的策略
		if (useNative && preferNative()) {
			return cacheNativeSelectLoops();
		}

        // 否则使用传统的NIO模式
		return cacheNioSelectLoops();
	}

	
	EventLoopGroup cacheNioSelectLoops() {
		if (serverSelectLoops == serverLoops) {
			return cacheNioServerLoops();
		}

		EventLoopGroup eventLoopGroup = serverSelectLoops.get();
		if (null == eventLoopGroup) {
			eventLoopGroup = cacheNioSelectLoops();
		}
		return eventLoopGroup;
	}
}


@FunctionalInterface
public interface LoopResources extends Disposable {

	/**
	 * Default value whether the native transport (epoll, kqueue) will be preferred,
	 * fallback it will be preferred when available
	 */
	boolean DEFAULT_NATIVE = Boolean.parseBoolean(System.getProperty(
			ReactorNetty.NATIVE,
			"true"));

	/**
	 * Return true if should default to native {@link EventLoopGroup} and {@link Channel}
	 *
	 * @return true if should default to native {@link EventLoopGroup} and {@link Channel}
	 */
	default boolean preferNative() {
		return DefaultLoopEpoll.hasEpoll() || DefaultLoopKQueue.hasKQueue();
	}
}

果然还是系统Property的老套路,于是又在网关的main方法中加了一行:

@SpringBootApplication
public class GatewayServerApplication {

    public static void main(String[] args) {

        System.setProperty(ReactorNetty.NATIVE, "false");
        System.setProperty(ReactorNetty.IO_WORKER_COUNT, "6");
        System.setProperty(ReactorNetty.IO_SELECT_COUNT, "6");


        SpringApplication.run(GatewayServerApplication.class, args);
    }
}

至于之前没加这个参数一会nio,一会epoll的,怀疑和 LoopResources 调用的本地方法有关系,这里就不纠结了。而且这里说的在 reactor-netty 中的NIO模式比EPOLL好并不一定具备参考性,一定得以实际的压测效果为准(还包括 reactor-netty的版本)。

坑三:SCG的同步更新路由信息

没多久QA同学又来找我麻烦了,说咋又有超时了,我屡了一下现象,发现可能不是之前线程过多的问题了,结合网关打印的请求日志看了下,发现网关在出问题的那一两秒,所有6个NIO都没干活了(准确的说是没打印请求和应答日志了),根据当时的网关调用情况以及网关下游服务的耗时,每一秒这6个NIO线程都应该能处理几个请求,不应该出现整整两秒钟不打印请求或应答日志的情况!!根据我十年搞Java的经验,该不会是GC了把,立马去看了下网关的GC日志,确定了那个时间确实没发生FGC(不相信,又去腾讯云的TSF监控上看了下,确实没有)。那么现在定位问题的唯一办法就是在出问题的时刻jstack了,但这是偶发问题,无法必现,必现也就一两秒,咋抓?Arthas?No,试了下没有好的方法支持。于是只能想想旁门左道了,能不能直接用Java代码打印线程堆栈?查了下还真可以,于是写了个Spring Bean,每个100毫秒打印一个NIO线程的堆栈(考虑到问题持续时间一两秒,100ms可以打多次堆栈,便于定位问题),代码如下:

@Component
public class TempTheadStackPrintService implements Runnable {

    private Logger threadLogger = LoggerFactory.getLogger("asyncThreadLogger");

    private final ScheduledExecutorService printThreadPool = new ScheduledThreadPoolExecutor(1,
            r -> new Thread(r, "SropTempTheadStackPrintService"));

    @PostConstruct
    public void init() {
        printThreadPool.scheduleWithFixedDelay(this, 120000, 100, TimeUnit.MILLISECONDS);
    }

    @Override
    public void run() {

        ThreadMXBean bean = ManagementFactory.getThreadMXBean();
        Map traceMap = Thread.getAllStackTraces();
        Set allThreads = traceMap.keySet();
        StringBuilder msg = new StringBuilder();
        for (Thread thread : allThreads) {
            long tid = thread.getId();
            ThreadInfo threadInfo = bean.getThreadInfo(tid);
            // 注意,这里只打印名称为reactor-http-nio-3线程的堆栈
            if (threadInfo == null || !threadInfo.getThreadName().startsWith("reactor-http-nio-3")) {
                continue;
            }

            String lockInfo = threadInfo.getLockName() == null ? " " : ", " + threadInfo.getLockName();
            msg.append("thread id: " + tid + ", name: " + threadInfo.getThreadName() +
                    ", state: " + threadInfo.getThreadState() + ", lock: " + lockInfo).append("\n");
            StackTraceElement[] stackTraces = thread.getStackTrace();
            for (StackTraceElement stackTrace : stackTraces) {
                msg.append("\t").append(stackTrace).append("\n");
            }

            msg.append("\n");
        }

        threadLogger.info(msg.toString());
    }
}

有了这个神器后,我就乖乖的等待QA美眉找我了,没过几分钟,QA果然又反馈了一个CASE,和之前一样,应该是网关内部问题,也是有两秒钟这6个NIO线程开小差了,于是乎我赶紧看 reactor-http-nio-3 的堆栈,看看它是在等待某个锁还是干嘛:

[2021-03-29 15:50:07.022][INFO] - thread id: 163, name: reactor-http-nio-3, state: RUNNABLE, lock:  
	java.security.AccessController.doPrivileged(Native Method)
	org.apache.commons.logging.LogFactory.getContextClassLoaderInternal(LogFactory.java:808)
	org.apache.commons.logging.LogFactory.getFactory(LogFactory.java:419)
	org.apache.commons.logging.LogFactory.getLog(LogFactory.java:655)
	org.springframework.core.env.PropertySource.(PropertySource.java:62)
	org.springframework.core.env.EnumerablePropertySource.(EnumerablePropertySource.java:48)
	org.springframework.core.env.MapPropertySource.(MapPropertySource.java:35)
	org.springframework.boot.context.properties.source.MapConfigurationPropertySource.(MapConfigurationPropertySource.java:55)
	org.springframework.cloud.gateway.support.ConfigurationUtils.bind(ConfigurationUtils.java:45)
	org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator.lookup(RouteDefinitionRouteLocator.java:243)
	org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator.combinePredicates(RouteDefinitionRouteLocator.java:217)
	org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator.convertToRoute(RouteDefinitionRouteLocator.java:142)
	org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator$$Lambda$395/564059141.apply(Unknown Source)


[2021-03-29 15:50:07.137][INFO] - thread id: 163, name: reactor-http-nio-3, state: RUNNABLE, lock:  
	java.lang.reflect.Array.get(Native Method)
	org.springframework.boot.context.properties.bind.Bindable.box(Bindable.java:254)
	org.springframework.boot.context.properties.bind.Bindable.of(Bindable.java:246)
	org.springframework.boot.context.properties.bind.JavaBeanBinder.bind(JavaBeanBinder.java:81)
	org.springframework.boot.context.properties.bind.JavaBeanBinder.bind(JavaBeanBinder.java:70)
	org.springframework.boot.context.properties.bind.JavaBeanBinder.bind(JavaBeanBinder.java:54)
	org.springframework.boot.context.properties.bind.Binder.lambda$null$4(Binder.java:329)
	org.springframework.boot.context.properties.bind.Binder$$Lambda$94/1847559273.apply(Unknown Source)
	java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
	java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1359)
	java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
	java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
	java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
	java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
	java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:531)
	org.springframework.boot.context.properties.bind.Binder.lambda$bindBean$5(Binder.java:330)
	org.springframework.boot.context.properties.bind.Binder$$Lambda$93/425107133.get(Unknown Source)
	org.springframework.boot.context.properties.bind.Binder$Context.withIncreasedDepth(Binder.java:429)
	org.springframework.boot.context.properties.bind.Binder$Context.withBean(Binder.java:415)
	org.springframework.boot.context.properties.bind.Binder$Context.access$400(Binder.java:372)
	org.springframework.boot.context.properties.bind.Binder.bindBean(Binder.java:328)
	org.springframework.boot.context.properties.bind.Binder.bindObject(Binder.java:269)
	org.springframework.boot.context.properties.bind.Binder.bind(Binder.java:214)
	org.springframework.boot.context.properties.bind.Binder.bind(Binder.java:202)
	org.springframework.boot.context.properties.bind.Binder.bind(Binder.java:159)
	org.springframework.cloud.gateway.support.ConfigurationUtils.bind(ConfigurationUtils.java:47)
	org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator.lookup(RouteDefinitionRouteLocator.java:243)
	org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator.combinePredicates(RouteDefinitionRouteLocator.java:212)
	org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator.convertToRoute(RouteDefinitionRouteLocator.java:142)
	org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator$$Lambda$395/564059141.apply(Unknown Source)


[2021-03-29 15:50:07.249][INFO] - thread id: 163, name: reactor-http-nio-3, state: RUNNABLE, lock:  
	java.security.AccessController.doPrivileged(Native Method)
	org.apache.commons.logging.LogFactory.getContextClassLoaderInternal(LogFactory.java:808)
	org.apache.commons.logging.LogFactory.getFactory(LogFactory.java:419)
	org.apache.commons.logging.LogFactory.getLog(LogFactory.java:655)
	org.springframework.core.env.PropertySource.(PropertySource.java:62)
	org.springframework.core.env.EnumerablePropertySource.(EnumerablePropertySource.java:48)
	org.springframework.core.env.MapPropertySource.(MapPropertySource.java:35)
	org.springframework.boot.context.properties.source.MapConfigurationPropertySource.(MapConfigurationPropertySource.java:55)
	org.springframework.cloud.gateway.support.ConfigurationUtils.bind(ConfigurationUtils.java:45)
	org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator.lookup(RouteDefinitionRouteLocator.java:243)
	org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator.combinePredicates(RouteDefinitionRouteLocator.java:217)
	org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator.convertToRoute(RouteDefinitionRouteLocator.java:142)
	org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator$$Lambda$395/564059141.apply(Unknown Source)


[2021-03-29 15:50:07.377][INFO] - thread id: 163, name: reactor-http-nio-3, state: RUNNABLE, lock:  
	java.lang.reflect.Array.get(Native Method)
	org.springframework.boot.context.properties.bind.Bindable.box(Bindable.java:254)
	org.springframework.boot.context.properties.bind.Bindable.of(Bindable.java:246)
	org.springframework.boot.context.properties.bind.JavaBeanBinder.bind(JavaBeanBinder.java:81)
	org.springframework.boot.context.properties.bind.JavaBeanBinder.bind(JavaBeanBinder.java:70)
	org.springframework.boot.context.properties.bind.JavaBeanBinder.bind(JavaBeanBinder.java:54)
	org.springframework.boot.context.properties.bind.Binder.lambda$null$4(Binder.java:329)
	org.springframework.boot.context.properties.bind.Binder$$Lambda$94/1847559273.apply(Unknown Source)
	java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
	java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1359)
	java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
	java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
	java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
	java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
	java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:531)
	org.springframework.boot.context.properties.bind.Binder.lambda$bindBean$5(Binder.java:330)
	org.springframework.boot.context.properties.bind.Binder$$Lambda$93/425107133.get(Unknown Source)
	org.springframework.boot.context.properties.bind.Binder$Context.withIncreasedDepth(Binder.java:429)
	org.springframework.boot.context.properties.bind.Binder$Context.withBean(Binder.java:415)
	org.springframework.boot.context.properties.bind.Binder$Context.access$400(Binder.java:372)
	org.springframework.boot.context.properties.bind.Binder.bindBean(Binder.java:328)
	org.springframework.boot.context.properties.bind.Binder.bindObject(Binder.java:269)
	org.springframework.boot.context.properties.bind.Binder.bind(Binder.java:214)
	org.springframework.boot.context.properties.bind.Binder.bind(Binder.java:202)
	org.springframework.boot.context.properties.bind.Binder.bind(Binder.java:159)
	org.springframework.cloud.gateway.support.ConfigurationUtils.bind(ConfigurationUtils.java:47)
	org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator.lookup(RouteDefinitionRouteLocator.java:243)
	org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator.combinePredicates(RouteDefinitionRouteLocator.java:212)
	org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator.convertToRoute(RouteDefinitionRouteLocator.java:142)
	org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator$$Lambda$395/564059141.apply(Unknown Source)

每100ms打印该线程的一次堆栈,我们其实容易发现,这个 reactor-http-nio-3 的线程没有阻塞(一直是RUNNABLE状态,没有等待锁),并且一直在执行 org.springframework.cloud.gateway.route.RouteDefinitionRouteLocator 这个类中的方法逻辑(每打印一次该类的堆栈,说明该线程在里面耗了100ms),这太奇怪了,我理解的刷新和转换路由不应该是异步的吗?虽然不明白解析和转换路由为啥会这么耗时(其实看SCG里面的代码,还挺复杂,用到了反射等逻辑),但总归应该是异步执行的,为啥IO线程会干这个事情?带着疑问我们开始去看SCG的代码,于是跟踪到了SCG的路由缓存类 CachingRouteLocator

public class CachingRouteLocator
		implements RouteLocator, ApplicationListener {

	private final RouteLocator delegate;

	private final Flux routes;

	private final Map cache = new HashMap<>();

	public CachingRouteLocator(RouteLocator delegate) {
		this.delegate = delegate;
        // 使用了CacheFlux工具类来实现缓存的效果
		routes = CacheFlux.lookup(cache, "routes", Route.class)
				.onCacheMissResume(() -> this.delegate.getRoutes()
						.sort(AnnotationAwareOrderComparator.INSTANCE));
	}

	@Override
	public Flux getRoutes() {
		return this.routes;
	}

	/**
	 * Clears the routes cache.
	 * @return routes flux
	 */
	public Flux refresh() {
        // 这里只是简单的把cache这个Map清空,然后调用getRoutes()时会触发CacheFlux中的重新加载缓存的逻辑,也就意味着并发时会多个线程去加载缓存
		this.cache.clear();
		return this.routes;
	}

	@Override
	public void onApplicationEvent(RefreshRoutesEvent event) {
        // 接收事件通知,刷新缓存
		refresh();
	}

	@Deprecated
	/* for testing */ void handleRefresh() {
		refresh();
	}
}

这明显就是让IO线程自己去解析和转换路由信息并且更新缓存,这是设计问题,所以把刷新缓存改成异步的应该就没这个问题了,于是有了下面的改法:

public class CachingRouteLocator
        implements RouteLocator, ApplicationListener {

    private final RouteLocator delegate;

    private final AtomicReference> cache = new AtomicReference<>(new HashMap<>());

    private static final String KEY = "routes";

    public CachingRouteLocator(RouteLocator delegate) {
        this.delegate = delegate;
        buildDematerialize(cache.get());
    }

    @Override
    public Flux getRoutes() {
        return Flux.defer(() -> Flux.fromIterable(cache.get().get(KEY)).dematerialize());
    }

    /**
     * Clears the routes cache.
     *
     * @return routes flux
     */
    public Flux refresh() {
        Map newCache = new HashMap<>();
        buildDematerialize(newCache);
        cache.set(newCache);
        return getRoutes();
    }

    /**
     * 参考 CacheFlux#lookup
     *
     * @return
     */
    private void buildDematerialize(Map newCache) {
        Flux.defer(() -> this.delegate.getRoutes()
                .sort(AnnotationAwareOrderComparator.INSTANCE)
                .materialize()
                .collectList()
                .doOnNext(signals -> newCache.put(KEY, signals))
                .flatMapIterable(Function.identity())
                .dematerialize()).subscribe();
    }

    @Override
    public void onApplicationEvent(RefreshRoutesEvent event) {
        refresh();
    }

    @Deprecated
        /* for testing */ void handleRefresh() {
        refresh();
    }
}

此改法其实是将工具了CacheFlux中的缓存逻辑搬出来改了改,这样刷新缓存直接由接收事件的线程来做,不需要麻烦IO线程。改完后我们立马看到了效果,在IO线程都忙碌的时候,再也看不到IO线程在处理路由解析和转换的逻辑了,即看不到之前那种经常发生一两秒IO线程不处理请求和应答的现象了。

本想给SCG提个PR来优化,但后面发现最新的版本已经没这个问题了,如下是最新的SCG实现:

public class CachingRouteLocator
		implements Ordered, RouteLocator, ApplicationListener, ApplicationEventPublisherAware {

	private static final Log log = LogFactory.getLog(CachingRouteLocator.class);

	private static final String CACHE_KEY = "routes";

	private final RouteLocator delegate;

	private final Flux routes;

	private final Map cache = new ConcurrentHashMap<>();

	private ApplicationEventPublisher applicationEventPublisher;

	public CachingRouteLocator(RouteLocator delegate) {
		this.delegate = delegate;
		routes = CacheFlux.lookup(cache, CACHE_KEY, Route.class).onCacheMissResume(this::fetch);
	}

	private Flux fetch() {
		return this.delegate.getRoutes().sort(AnnotationAwareOrderComparator.INSTANCE);
	}

	@Override
	public Flux getRoutes() {
		return this.routes;
	}

	/**
	 * Clears the routes cache.
	 * @return routes flux
	 */
	public Flux refresh() {
		this.cache.clear();
		return this.routes;
	}

	@Override
	public void onApplicationEvent(RefreshRoutesEvent event) {
		try {
			fetch().collect(Collectors.toList()).subscribe(
					list -> Flux.fromIterable(list).materialize().collect(Collectors.toList()).subscribe(signals -> {
						applicationEventPublisher.publishEvent(new RefreshRoutesResultEvent(this));
						cache.put(CACHE_KEY, signals);
					}, this::handleRefreshError), this::handleRefreshError);
		}
		catch (Throwable e) {
			handleRefreshError(e);
		}
	}

	private void handleRefreshError(Throwable throwable) {
		if (log.isErrorEnabled()) {
			log.error("Refresh routes error !!!", throwable);
		}
		applicationEventPublisher.publishEvent(new RefreshRoutesResultEvent(this, throwable));
	}

	@Override
	public int getOrder() {
		return 0;
	}

	@Override
	public void setApplicationEventPublisher(ApplicationEventPublisher applicationEventPublisher) {
		this.applicationEventPublisher = applicationEventPublisher;
	}
}

坑四:Ribbon的懒加载

这个坑来自Ribbon,在优化了很多性能问题后,发现还有人反馈请求慢,然后跟进那些请求慢的traceId,都发现有如下日志:

[2021-04-11 17:26:33.621][INFO][reactor-http-nio-4] - Flipping property: XXX应用名称.ribbon.ActiveConnectionsLimit to use NEXT property: niws.loadbalancer.availabilityFilteringRule.activeConnectionsLimit = 2147483647
[2021-04-11 17:26:33.622][INFO][reactor-http-nio-4] - Shutdown hook installed for: NFLoadBalancer-PingTimer-XXX应用名称
[2021-04-11 17:26:33.628][INFO][reactor-http-nio-4] - Flipping property: XXX应用名称.ribbon.ActiveConnectionsLimit to use NEXT property: niws.loadbalancer.availabilityFilteringRule.activeConnectionsLimit = 2147483647

这里到底是谁打的日志?第一感觉像Ribbon,于是特意下了Ribbon的代码,找了一番没找到,最后没办法,只能使用log4j2.xml的%c来打印日志输出的类了,发现是Ribbon依赖的 archaius-core(com.netflix.config.ChainedDynamicProperty)和 netflix-commons-util(com.netflix.util.concurrent.ShutdownEnabledTimer),后面发现每个应用只会发生一次,这就比较奇怪了,后面脑回路才转到Ribbon懒加载这边来,于是加上如下配置即可:

ribbon:
  eager-load:
    enabled: true
    clients: 应用1, 应用2, 应用3

但网关下游应用颇多,所以为了不影响SCG网关启动速度,我们会优先把一些核心的模块填上去。

坑五:堆外内存泄露

在进行一次业务大规模数据同步的时候,由于请求体和请求次数都比较大,SCG网关突然收到如下错误:

io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 6257901575, max: 6274023424)
  at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:667) ~[netty-common-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:622) ~[netty-common-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:772) ~[netty-buffer-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:748) ~[netty-buffer-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:245) ~[netty-buffer-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.buffer.PoolArena.allocate(PoolArena.java:227) ~[netty-buffer-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.buffer.PoolArena.allocate(PoolArena.java:147) ~[netty-buffer-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:342) ~[netty-buffer-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187) ~[netty-buffer-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178) ~[netty-buffer-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:115) ~[netty-buffer-4.1.36.Final.jar!/:4.1.36.Final]
  at org.springframework.core.io.buffer.NettyDataBufferFactory.allocateBuffer(NettyDataBufferFactory.java:71) ~[spring-core-5.1.8.RELEASE.jar!/:5.1.8.RELEASE]
  at org.springframework.core.io.buffer.NettyDataBufferFactory.allocateBuffer(NettyDataBufferFactory.java:39) ~[spring-core-5.1.8.RELEASE.jar!/:5.1.8.RELEASE]
  at org.springframework.core.codec.CharSequenceEncoder.lambda$encode$1(CharSequenceEncoder.java:85) ~[spring-core-5.1.8.RELEASE.jar!/:5.1.8.RELEASE]
  at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:107) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at org.springframework.cloud.sleuth.instrument.reactor.ScopePassingSpanSubscriber.onNext(ScopePassingSpanSubscriber.java:96) ~[spring-cloud-sleuth-core-2.1.2.RELEASE.jar!/:2.1.2.RELEASE]
  at org.springframework.cloud.sleuth.instrument.reactor.ScopePassingSpanSubscriber.onNext(ScopePassingSpanSubscriber.java:96) ~[spring-cloud-sleuth-core-2.1.2.RELEASE.jar!/:2.1.2.RELEASE]
  at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1643) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:241) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2205) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at reactor.core.publisher.MonoFlatMap$FlatMapInner.onSubscribe(MonoFlatMap.java:230) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at reactor.core.publisher.MonoJust.subscribe(MonoJust.java:54) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at reactor.core.publisher.MonoLiftFuseable.subscribe(MonoLiftFuseable.java:63) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:150) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at org.springframework.cloud.sleuth.instrument.reactor.ScopePassingSpanSubscriber.onNext(ScopePassingSpanSubscriber.java:96) ~[spring-cloud-sleuth-core-2.1.2.RELEASE.jar!/:2.1.2.RELEASE]
  at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:73) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at org.springframework.cloud.sleuth.instrument.reactor.ScopePassingSpanSubscriber.onNext(ScopePassingSpanSubscriber.java:96) ~[spring-cloud-sleuth-core-2.1.2.RELEASE.jar!/:2.1.2.RELEASE]
  at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:73) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at org.springframework.cloud.sleuth.instrument.reactor.ScopePassingSpanSubscriber.onNext(ScopePassingSpanSubscriber.java:96) ~[spring-cloud-sleuth-core-2.1.2.RELEASE.jar!/:2.1.2.RELEASE]
  at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:121) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at reactor.core.publisher.FluxContextStart$ContextStartSubscriber.onNext(FluxContextStart.java:103) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at org.springframework.cloud.sleuth.instrument.reactor.ScopePassingSpanSubscriber.onNext(ScopePassingSpanSubscriber.java:96) ~[spring-cloud-sleuth-core-2.1.2.RELEASE.jar!/:2.1.2.RELEASE]
  at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:121) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at org.springframework.cloud.sleuth.instrument.reactor.ScopePassingSpanSubscriber.onNext(ScopePassingSpanSubscriber.java:96) ~[spring-cloud-sleuth-core-2.1.2.RELEASE.jar!/:2.1.2.RELEASE]
  at reactor.core.publisher.FluxFilterFuseable$FilterFuseableSubscriber.onNext(FluxFilterFuseable.java:113) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at org.springframework.cloud.sleuth.instrument.reactor.ScopePassingSpanSubscriber.onNext(ScopePassingSpanSubscriber.java:96) ~[spring-cloud-sleuth-core-2.1.2.RELEASE.jar!/:2.1.2.RELEASE]
  at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1643) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at reactor.core.publisher.MonoCollectList$MonoCollectListSubscriber.onComplete(MonoCollectList.java:121) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at org.springframework.cloud.sleuth.instrument.reactor.ScopePassingSpanSubscriber.onComplete(ScopePassingSpanSubscriber.java:112) ~[spring-cloud-sleuth-core-2.1.2.RELEASE.jar!/:2.1.2.RELEASE]
  at reactor.core.publisher.FluxMap$MapSubscriber.onComplete(FluxMap.java:136) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at org.springframework.cloud.sleuth.instrument.reactor.ScopePassingSpanSubscriber.onComplete(ScopePassingSpanSubscriber.java:112) ~[spring-cloud-sleuth-core-2.1.2.RELEASE.jar!/:2.1.2.RELEASE]
  at reactor.core.publisher.FluxPeek$PeekSubscriber.onComplete(FluxPeek.java:252) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at org.springframework.cloud.sleuth.instrument.reactor.ScopePassingSpanSubscriber.onComplete(ScopePassingSpanSubscriber.java:112) ~[spring-cloud-sleuth-core-2.1.2.RELEASE.jar!/:2.1.2.RELEASE]
  at reactor.core.publisher.FluxMap$MapSubscriber.onComplete(FluxMap.java:136) ~[reactor-core-3.2.22.RELEASE.jar!/:3.2.22.RELEASE]
  at reactor.netty.channel.FluxReceive.terminateReceiver(FluxReceive.java:426) ~[reactor-netty-0.8.23.RELEASE.jar!/:0.8.23.RELEASE]
  at reactor.netty.channel.FluxReceive.drainReceiver(FluxReceive.java:210) ~[reactor-netty-0.8.23.RELEASE.jar!/:0.8.23.RELEASE]
  at reactor.netty.channel.FluxReceive.onInboundComplete(FluxReceive.java:368) ~[reactor-netty-0.8.23.RELEASE.jar!/:0.8.23.RELEASE]
  at reactor.netty.channel.ChannelOperations.onInboundComplete(ChannelOperations.java:370) ~[reactor-netty-0.8.23.RELEASE.jar!/:0.8.23.RELEASE]
  at reactor.netty.http.server.HttpServerOperations.onInboundNext(HttpServerOperations.java:507) ~[reactor-netty-0.8.23.RELEASE.jar!/:0.8.23.RELEASE]
  at reactor.netty.channel.ChannelOperationsHandler.channelRead(ChannelOperationsHandler.java:92) [reactor-netty-0.8.23.RELEASE.jar!/:0.8.23.RELEASE]
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at reactor.netty.http.server.HttpTrafficHandler.channelRead(HttpTrafficHandler.java:214) [reactor-netty-0.8.23.RELEASE.jar!/:0.8.23.RELEASE]
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:617) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:534) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) [netty-transport-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) [netty-common-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.36.Final.jar!/:4.1.36.Final]
  at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.36.Final.jar!/:4.1.36.Final]
  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]

可以看出在申请约16m的直接内存区的时候,分配失败了,因为容器的6G空间几乎满了,第一感觉不应该是溢出,而是泄露了,说不定也是低版本问题导致的,于是在网上查找了一番,发现一个SCG的issue和我遇到的很像: https://github.com/spring-cloud/spring-cloud-gateway/issues/2090,原来并不是SCG的问题,而是我写的Spring Web的全局异常处理器的问题,原处理器代码如下:

@Order(-1)
@Component
public class GatewayErrorWebExceptionHandler implements ErrorWebExceptionHandler {

    private Logger appLogger = LoggerFactory.getLogger(GatewayErrorWebExceptionHandler.class);
    private Logger warnLogger = LoggerFactory.getLogger("asyncWarnLogger");


    @Override
    public Mono handle(ServerWebExchange exchange, Throwable ex) {
        ServerHttpResponse response = exchange.getResponse();

        if (response.isCommitted()) {
            warnLogger.error("GatewayErrorWebExceptionHandler#handle response.isCommitted(), traceId:{}, header:{}, " +
                            "queryParams:{}", exchange.getAttribute(REQUEST_TRACEID_ATTRIBUTE),
                    exchange.getRequest().getHeaders(), exchange.getRequest().getQueryParams(), ex);
            return Mono.error(ex);
        }

        // 网关调用下游的异常不能直接吐给上游,
        response.setStatusCode(HttpStatus.OK);
        return response.writeWith(Mono.fromSupplier(() -> {
            DataBufferFactory bufferFactory = response.bufferFactory();
            try {

                HttpResult exceptionHttpResult = null;
                boolean printWarnLog = true;
                try {
                    if (ex instanceof GatewayException) {
                        GatewayException gatewayException = (GatewayException) ex;
                        exceptionHttpResult = HttpResult.build(null, gatewayException.getCode(),
                                gatewayException.getMessage());
                        return bufferFactory.wrap(JSON.toJSONBytes(exceptionHttpResult));

                    } else if (ex instanceof ResponseStatusException) {
                        ResponseStatusException responseStatusException = (ResponseStatusException) ex;
                        String extInfo = StringUtils.isBlank(responseStatusException.getReason()) ?
                                responseStatusException.getMessage() : responseStatusException.getReason();
                        exceptionHttpResult = HttpResult.build(null,
                                OPEN_PLATFORM_GATEWAY_INVOKE_ERROR.getCode(),
                                OPEN_PLATFORM_GATEWAY_INVOKE_ERROR.getDescription() + ", " + extInfo);
                        return bufferFactory.wrap(JSON.toJSONBytes(exceptionHttpResult));

                    } else {
                        printWarnLog = false;
                        exceptionHttpResult = HttpResult.build(null,
                                OPEN_PLATFORM_GATEWAY_INVOKE_ERROR.getCode(),
                                OPEN_PLATFORM_GATEWAY_INVOKE_ERROR.getDescription() + ", " + ex.getMessage());
                        return bufferFactory.wrap(JSON.toJSONBytes(exceptionHttpResult));
                    }

                } finally {
                    RequestPath reqPath = exchange.getRequest().getPath();
                    String path = reqPath != null ? reqPath.toString() : null;

                    MultiValueMap queryParams = exchange.getRequest().getQueryParams();
                    Long reqStartNanoTime = exchange.getAttribute(REQUEST_START_NANO_TIME_ATTRIBUTE);
                    StringBuilder logBuilder = new StringBuilder("Invoke fail,traceId:")
                            .append((String) exchange.getAttribute(REQUEST_TRACEID_ATTRIBUTE))
                            .append(", path:")
                            .append(path)
                            .append(",headers:")
                            .append(JSON.toJSONString(exchange.getRequest().getHeaders()))
                            .append(", gateway应答:").append(exceptionHttpResult)
                            .append(", costTime:").append(reqStartNanoTime == null ?
                                    "计算失败" : (System.nanoTime() - reqStartNanoTime.longValue()) / 1000000);
                    String logContext = logBuilder.toString();

                    // 日志分开打印,主app.log不输出堆栈信息,堆栈信息由warn或error日志来输出
                    appLogger.warn(logContext);
                    if (printWarnLog) {
                        warnLogger.warn(logContext, ex);
                    } else {
                        warnLogger.error(logContext, ex);
                    }
                    
                }
            } catch (Throwable e) {
                warnLogger.error("GatewayErrorWebExceptionHandler#handle Error writing response, traceId:" +
                        exchange.getAttribute(REQUEST_TRACEID_ATTRIBUTE), ex);
                return bufferFactory.wrap(new byte[0]);
            }
        }));
    }
}

修改后的代码如下:

@Order(-1)
@Component
public class GatewayErrorWebExceptionHandler implements ErrorWebExceptionHandler {

    private Logger appLogger = LoggerFactory.getLogger(GatewayErrorWebExceptionHandler.class);
    private Logger warnLogger = LoggerFactory.getLogger("asyncWarnLogger");


    @Override
    public Mono handle(ServerWebExchange exchange, Throwable ex) {
        ServerHttpResponse response = exchange.getResponse();

        if (response.isCommitted()) {
            warnLogger.error("GatewayErrorWebExceptionHandler#handle response.isCommitted(), traceId:{}, header:{}, " +
                            "queryParams:{}", exchange.getAttribute(REQUEST_TRACEID_ATTRIBUTE),
                    exchange.getRequest().getHeaders(), exchange.getRequest().getQueryParams(), ex);
            return Mono.error(ex);
        }

        // 网关调用下游的异常不能直接吐给上游,
        response.setStatusCode(HttpStatus.OK);
        return response.writeWith(Mono.fromSupplier(() -> {
            DataBufferFactory bufferFactory = response.bufferFactory();
            try {

                HttpResult exceptionHttpResult = null;
                boolean printWarnLog = true;
                try {
                    if (ex instanceof GatewayException) {
                        GatewayException gatewayException = (GatewayException) ex;
                        exceptionHttpResult = HttpResult.build(null, gatewayException.getCode(),
                                gatewayException.getMessage());
                        return getDataBuffer(bufferFactory, exceptionHttpResult);

                    } else if (ex instanceof ResponseStatusException) {
                        ResponseStatusException responseStatusException = (ResponseStatusException) ex;
                        String extInfo = StringUtils.isBlank(responseStatusException.getReason()) ?
                                responseStatusException.getMessage() : responseStatusException.getReason();
                        exceptionHttpResult = HttpResult.build(null,
                                OPEN_PLATFORM_GATEWAY_INVOKE_ERROR.getCode(),
                                OPEN_PLATFORM_GATEWAY_INVOKE_ERROR.getDescription() + ", " + extInfo);

                        return getDataBuffer(bufferFactory, exceptionHttpResult);

                    } else {
                        printWarnLog = false;
                        exceptionHttpResult = HttpResult.build(null,
                                OPEN_PLATFORM_GATEWAY_INVOKE_ERROR.getCode(),
                                OPEN_PLATFORM_GATEWAY_INVOKE_ERROR.getDescription() + ", " + ex.getMessage());

                        return getDataBuffer(bufferFactory, exceptionHttpResult);
                    }

                } finally {
                    RequestPath reqPath = exchange.getRequest().getPath();
                    String path = reqPath != null ? reqPath.toString() : null;

                    MultiValueMap queryParams = exchange.getRequest().getQueryParams();
                    Long reqStartNanoTime = exchange.getAttribute(REQUEST_START_NANO_TIME_ATTRIBUTE);
                    StringBuilder logBuilder = new StringBuilder("Invoke fail,traceId:")
                            .append((String) exchange.getAttribute(REQUEST_TRACEID_ATTRIBUTE))
                            .append(", path:")
                            .append(path)
                            .append(",headers:")
                            .append(JSON.toJSONString(exchange.getRequest().getHeaders()))
                            .append(", gateway应答:").append(exceptionHttpResult)
                            .append(", costTime:").append(reqStartNanoTime == null ?
                                    "计算失败" : (System.nanoTime() - reqStartNanoTime.longValue()) / 1000000);
                    String logContext = logBuilder.toString();

                    // 日志分开打印,主app.log不输出堆栈信息,堆栈信息由warn或error日志来输出
                    appLogger.warn(logContext);
                    ActiveSpan.tag("gateway应答", exceptionHttpResult.toString());
                    if (printWarnLog) {
                        warnLogger.warn(logContext, ex);
                    } else {
                        warnLogger.error(logContext, ex);
                    }
                }
            } catch (Throwable e) {
                warnLogger.error("GatewayErrorWebExceptionHandler#handle Error writing response, traceId:" +
                        exchange.getAttribute(REQUEST_TRACEID_ATTRIBUTE), ex);
                return bufferFactory.allocateBuffer(0).write(new byte[0]);
            }
        }));
    }

    /**
     * 尝试规避堆外内存泄露问题
     * 

* https://github.com/spring-cloud/spring-cloud-gateway/issues/2090 *

* * @param bufferFactory * @param exceptionHttpResult * @return */ @NotNull private DataBuffer getDataBuffer(DataBufferFactory bufferFactory, HttpResult exceptionHttpResult) { byte[] jsonBytes = JSON.toJSONBytes(exceptionHttpResult); return bufferFactory.allocateBuffer(jsonBytes.length).write(jsonBytes); } }

修改之后暂时没遇到之前相同的问题。

坑六:无法突破的200QPS

说出来你可能不信,在SCG网关上线前夕,我们又借着压下游业务模块压了一次网关,QA同学反馈死活压不上去,8台SCG网关,总QPS才1千多,平摊到每台SCG网关不到200QPS!!!,我们当时不敢相信,于是自己在本地也压了下,果然,才168QPS,而且下游接口是直接返回,不做业务处理(尽量规避下游业务系统的因素),我之前做Dubbo网关的时候,单台只至少也得有两千多QPS,难道HTTP不适合做RPC的弊端显现了???于是又仔细看了下网关的日志:

[2021-04-16 13:23:08.212][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588212452
[2021-04-16 13:23:08.258][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588212452
[2021-04-16 13:23:08.260][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588212452
[2021-04-16 13:23:08.260][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588212452
[2021-04-16 13:23:08.262][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588262147
[2021-04-16 13:23:08.285][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588262147
[2021-04-16 13:23:08.286][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588262147
[2021-04-16 13:23:08.286][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588262147
[2021-04-16 13:23:08.287][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588287446
[2021-04-16 13:23:08.336][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588287446
[2021-04-16 13:23:08.336][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588287446
[2021-04-16 13:23:08.337][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588287446
[2021-04-16 13:23:08.338][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588338801
[2021-04-16 13:23:08.384][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588338801
[2021-04-16 13:23:08.385][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588338801
[2021-04-16 13:23:08.385][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588338801
[2021-04-16 13:23:08.386][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588386838
[2021-04-16 13:23:08.412][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588386838
[2021-04-16 13:23:08.412][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588386838
[2021-04-16 13:23:08.412][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588386838
[2021-04-16 13:23:08.414][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588413651
[2021-04-16 13:23:08.462][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588413651
[2021-04-16 13:23:08.462][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588413651
[2021-04-16 13:23:08.463][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588413651
[2021-04-16 13:23:08.464][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588464333
[2021-04-16 13:23:08.504][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588464333
[2021-04-16 13:23:08.505][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588464333
[2021-04-16 13:23:08.505][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588464333
[2021-04-16 13:23:08.506][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588506374
[2021-04-16 13:23:08.537][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588506374
[2021-04-16 13:23:08.538][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588506374
[2021-04-16 13:23:08.538][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588506374
[2021-04-16 13:23:08.540][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588539761
[2021-04-16 13:23:08.587][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588539761
[2021-04-16 13:23:08.588][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588539761
[2021-04-16 13:23:08.588][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588539761
[2021-04-16 13:23:08.591][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588589373
[2021-04-16 13:23:08.632][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588589373
[2021-04-16 13:23:08.633][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588589373
[2021-04-16 13:23:08.633][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588589373
[2021-04-16 13:23:08.634][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588634798
[2021-04-16 13:23:08.666][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588634798
[2021-04-16 13:23:08.666][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588634798
[2021-04-16 13:23:08.666][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588634798
[2021-04-16 13:23:08.668][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588668274
[2021-04-16 13:23:08.715][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588668274
[2021-04-16 13:23:08.716][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588668274
[2021-04-16 13:23:08.716][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588668274
[2021-04-16 13:23:08.717][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550587586243
[2021-04-16 13:23:08.718][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550587745906
[2021-04-16 13:23:08.718][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550587613280
[2021-04-16 13:23:08.718][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550587927424
[2021-04-16 13:23:08.718][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550587878094
[2021-04-16 13:23:08.719][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550587798467
[2021-04-16 13:23:08.719][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550587666680
[2021-04-16 13:23:08.719][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550587953401
[2021-04-16 13:23:08.719][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550587691449
[2021-04-16 13:23:08.719][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550587823334
[2021-04-16 13:23:08.720][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550588029990
[2021-04-16 13:23:08.720][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550588084089
[2021-04-16 13:23:08.720][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550588004409
[2021-04-16 13:23:08.720][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550588262147
[2021-04-16 13:23:08.721][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550588506374
[2021-04-16 13:23:08.725][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588725826
[2021-04-16 13:23:08.794][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550588539761
[2021-04-16 13:23:08.796][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588796786
[2021-04-16 13:23:08.844][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588796786
[2021-04-16 13:23:08.845][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588796786
[2021-04-16 13:23:08.845][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588796786
[2021-04-16 13:23:08.846][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588846674
[2021-04-16 13:23:08.871][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588846674
[2021-04-16 13:23:08.871][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588846674
[2021-04-16 13:23:08.871][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588846674
[2021-04-16 13:23:08.873][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588873692
[2021-04-16 13:23:08.921][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588873692
[2021-04-16 13:23:08.922][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588873692
[2021-04-16 13:23:08.922][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588873692
[2021-04-16 13:23:08.923][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588923397
[2021-04-16 13:23:08.965][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588923397
[2021-04-16 13:23:08.966][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588923397
[2021-04-16 13:23:08.966][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588923397
[2021-04-16 13:23:08.967][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588967944
[2021-04-16 13:23:08.997][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588967944
[2021-04-16 13:23:08.998][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588967944
[2021-04-16 13:23:08.998][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588967944
[2021-04-16 13:23:08.999][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550588999795
[2021-04-16 13:23:09.049][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550588999795
[2021-04-16 13:23:09.049][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550588999795
[2021-04-16 13:23:09.049][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588999795
[2021-04-16 13:23:09.051][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550589051448
[2021-04-16 13:23:09.090][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550589051448
[2021-04-16 13:23:09.091][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550589051448
[2021-04-16 13:23:09.091][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550589051448
[2021-04-16 13:23:09.093][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550589093111
[2021-04-16 13:23:09.123][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550589093111
[2021-04-16 13:23:09.124][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550589093111
[2021-04-16 13:23:09.124][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550589093111
[2021-04-16 13:23:09.125][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550589125442
[2021-04-16 13:23:09.171][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550589125442
[2021-04-16 13:23:09.171][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550589125442
[2021-04-16 13:23:09.172][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550589125442
[2021-04-16 13:23:09.173][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550589173054
[2021-04-16 13:23:09.198][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550589173054
[2021-04-16 13:23:09.199][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550589173054
[2021-04-16 13:23:09.199][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550589173054
[2021-04-16 13:23:09.200][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550589200780
[2021-04-16 13:23:09.248][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550589200780
[2021-04-16 13:23:09.249][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550589200780
[2021-04-16 13:23:09.249][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550589200780
[2021-04-16 13:23:09.250][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550589250680
[2021-04-16 13:23:09.291][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550589250680
[2021-04-16 13:23:09.292][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550589250680
[2021-04-16 13:23:09.292][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550589250680
[2021-04-16 13:23:09.293][INFO][reactor-http-nio-3] - Gateway receive request, traceId:1618550589293793
[2021-04-16 13:23:09.321][INFO][reactor-http-nio-3] - ApiCheckGlobalFilter#filter traceId:1618550589293793
[2021-04-16 13:23:09.322][INFO][reactor-http-nio-3] - Gateway loadbalance, traceId:1618550589293793
[2021-04-16 13:23:09.322][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550589293793
[2021-04-16 13:23:09.323][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550588160623
[2021-04-16 13:23:09.323][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550588212452

这个是其中某个线程(reactor-http-nio-3)在某段时间内的日志,我们可以看 traceId:1618550588212452 的这次请求(上图的头四行和最后一行)有五条日志记录,第一条是SCG网关收到请求的日志,最后一条([2021-04-16 13:23:09.323][INFO][reactor-http-nio-3] - 【 Invoke Success 】 traceId:1618550588212452)是网关打印的调用下游成功的日志(收到下游成功应答时打印),而上图的第四条日志([2021-04-16 13:23:08.260][INFO][reactor-http-nio-3] - Gateway send request, traceId:1618550588212452)是这次请求到达网关后网关调用下游服务的请求时间,所以虽然第四条(网关请求下游)和最后一条(网关收到下游应答)之间相隔1秒之多,但负责该次请求的现在在这1秒之间也没闲着,一直在处理其他请求和应答,说明该线程没有遇到明显阻塞或锁。我们通过skywalking能看出该次请求在下游耗时才短短几毫秒,在网关这边发送到接收请求耗时1秒,难道是reactor-netty的问题?难道是我们用的SCG版本太老了?于是花了点时间在最新版的SCG上实现了一套简易版的网关(剔除了脚手架和TSF),该网关只配置一个Api,直连下游的业务系统,将网关在本地部署后发压,结果令人满意,妥妥的1500QPS!!难道升级SCG版本能带来10倍的性能提升?活见鬼!!又难道是因为没使用脚手架和TSF后性能带来的质的飞跃?也不可能吧,业务系统都使用了脚手架和TSF,也没见性能这么差。

不会是。。。。。Api数量?于是在原来的SCG网关,将其他Api全部删除,就留一个Api,本地部署后发压,妥妥的400QPS,比之前的160QPS好了很多!

这确实是个令人尴尬的结论,于是我在最新版本SCG上也伪造了3000个Api(相当于3000个路由规则),结果。。。新版本网关本地压测从1500QPS跌到了180QPS......看来实锤了,当SCG的路由(谓词)解析规则在数千或上万个时,SCG会有很严重的性能问题,而且新老版本都有这个问题!!

那如何优化?路由规则解析到底是在哪做的?后面搜寻了下源码,发现是在 org.springframework.cloud.gateway.handler.RoutePredicateHandlerMapping#lookupRoute 中实现的,该RoutePredicateHandlerMapping继承自AbstractHandlerMapping,属于Spring-Web的范畴了,我们看下Spring-Web的 DispatcherHandler#handle 方法可以得知:

	@Override
	public Mono handle(ServerWebExchange exchange) {
		if (this.handlerMappings == null) {
			return createNotFoundError();
		}
		return Flux.fromIterable(this.handlerMappings)
				.concatMap(mapping -> mapping.getHandler(exchange))
                // 这里的next()很关键,说明只取一个
				.next()
				.switchIfEmpty(createNotFoundError())
				.flatMap(handler -> invokeHandler(exchange, handler))
				.flatMap(result -> handleResult(exchange, result));
	}

我们再看看 RoutePredicateHandlerMapping#lookupRoute 的实现:

protected Mono lookupRoute(ServerWebExchange exchange) {
		return this.routeLocator.getRoutes()
				// individually filter routes so that filterWhen error delaying is not a
				// problem
				.concatMap(route -> Mono.just(route).filterWhen(r -> {
					// add the current route we are testing
					exchange.getAttributes().put(GATEWAY_PREDICATE_ROUTE_ATTR, r.getId());
					return r.getPredicate().apply(exchange);
				})
						// instead of immediately stopping main flux due to error, log and
						// swallow it
						.doOnError(e -> logger.error(
								"Error applying predicate for route: " + route.getId(),
								e))
						.onErrorResume(e -> Mono.empty()))
				// .defaultIfEmpty() put a static Route not found
				// or .switchIfEmpty()
				// .switchIfEmpty(Mono.empty().log("noroute"))
				.next()
				// TODO: error handling
				.map(route -> {
					if (logger.isDebugEnabled()) {
						logger.debug("Route matched: " + route.getId());
					}
					validateRoute(route, exchange);
					return route;
				});

	}

核心就是从一个路由列表中(this.routeLocator.getRoutes())获得一个Route(Mono),这里遍历所有的路由规则直到找到一个符合的,当你的路由规则成千上万时,当然会慢!!

结合我们SCG对Api的设计,因为路由和Api是一一对应的,知道Api其实已经知道路由信息了,不用走SCG自己那套功能强大但性能有损耗的路由规则解析,我们何不在构建路由信息的时候构建一个Map,key为路由ID,这个ID是可以自定义的,我们可以让Api的唯一标识作为路由的ID,Map的value就是路由的信息,这样当请求过来,根据我们的协议规则可以解析出Api,有了Api就可以通过这个Map拿到对应的路由对象了,可以绕过RoutePredicateHandlerMapping#lookupRoute的原有逻辑。只要我们能在 RoutePredicateHandlerMapping#lookupRoute 执行之前解析出该请求对应的Api,然后从之前准备的Map中取出Route,就能不走 RoutePredicateHandlerMapping#lookupRoute 原逻辑,于是我们有了 ProtocolHandlerMapping,代码如下:

@Component
public class ProtocolHandlerMapping extends AbstractHandlerMapping {

    @Autowired
    private GatewayExtPointManager gatewayExtPointManager;

    public ProtocolHandlerMapping() {
        /**
         * 保证在 {@link RoutePredicateHandlerMapping} 之前执行
         */
        setOrder(0);
    }

    @Override
    protected Mono getHandlerInternal(ServerWebExchange exchange) {

        // 解析Api(apiBean非null)
        ApiBean apiBean = 【从 exchange 解析出Api,代码略】;
        // 这里将Api信息设置到Attributes中
        exchange.getAttributes().put(API_NAME_ATTRIBUTE, apiBean.getApiName());
        exchange.getAttributes().put(API_VERSION_ATTRIBUTE, apiBean.getApiVersion());


        // 注意,这里要返回empty,因为按照 DispatcherHandler 的逻辑,它只处理第一个,
        // 如果这里返回非empty,将导致后面的 {@link RoutePredicateHandlerMapping} 无法执行
        return Mono.empty();
    }
}

可以看出 ProtocolHandlerMapping 主要就是从请求中解析出Api,然后把Api信息放到上下文exchange中。那如何改变 RoutePredicateHandlerMapping#lookupRoute 原逻辑?忘记说了,我们这边维护了SCG的一套源码,所以直接改源码然后打个包就可以给网关来用,修改后的 RoutePredicateHandlerMapping#lookupRoute 如下:

    /**
     * 注意,该方法在Route非常多时(比如大几千、上万),会有严重的性能问题,所以我们结合对 CachingRouteLocator 的改造,来规避此性能隐患
     * 性能优化的关键在于将路由谓词匹配转换成Map的get操作,该方案只对SCG网关生效
     *
     * @param exchange
     * @return
     */
    protected Mono lookupRoute(ServerWebExchange exchange) {

        /**
         * 针对SCG网关专门定制的性能优化逻辑,直接通过api唯一标识来匹配路由信息
         */
        String sropApiNameAttribute = exchange.getAttribute("sropApiNameAttribute");
        String sropApiVersionAttribute = exchange.getAttribute("sropApiVersionAttribute");
        if (!StringUtils.isEmpty(sropApiNameAttribute) && !StringUtils.isEmpty(sropApiVersionAttribute)
                && this.routeLocator instanceof CachingRouteLocator) {
            CachingRouteLocator cachingRouteLocator = (CachingRouteLocator) this.routeLocator;
            // 注意,这里的getRouteMap()也是新加的方法
            return cachingRouteLocator.getRouteMap().next().map(map ->
                    map.get(sropApiNameAttribute + "#" + sropApiVersionAttribute))
                    // 这里保证如果适配不到,仍然走老的官方适配逻辑
                    .switchIfEmpty(matchRoute(exchange));
        }

        return matchRoute(exchange);
    }

    /**
     * 将官方适配逻辑提取成方法
     *
     * @param exchange
     * @return
     */
    private Mono matchRoute(ServerWebExchange exchange) {
        return this.routeLocator.getRoutes()
                // individually filter routes so that filterWhen error delaying is not a
                // problem
                .concatMap(route -> Mono.just(route).filterWhen(r -> {
                    // add the current route we are testing
                    exchange.getAttributes().put(GATEWAY_PREDICATE_ROUTE_ATTR, r.getId());
                    return r.getPredicate().apply(exchange);
                })
                        // instead of immediately stopping main flux due to error, log and
                        // swallow it
                        .doOnError(e -> logger.error(
                                "Error applying predicate for route: " + route.getId(),
                                e))
                        .onErrorResume(e -> Mono.empty()))
                // .defaultIfEmpty() put a static Route not found
                // or .switchIfEmpty()
                // .switchIfEmpty(Mono.empty().log("noroute"))
                .next()
                // TODO: error handling
                .map(route -> {
                    if (logger.isDebugEnabled()) {
                        logger.debug("Route matched: " + route.getId());
                    }
                    validateRoute(route, exchange);
                    return route;
                });

    }

从上面代码可以看出,我们把原有的逻辑改成了兜底逻辑,即在Map中找不到该Api对应的Route时,还是走原有的逻辑。上面还提到我们在 CachingRouteLocator 中新增了一个getRouteMap的方法,因为如果想拥有Api到Route的Map,必须要在 CachingRouteLocator 中来构建,修改后的 CachingRouteLocator 代码如下:

public class CachingRouteLocator
        implements RouteLocator, ApplicationListener {

    private final RouteLocator delegate;

    private static final String KEY = "routes";
    private final AtomicReference> cache = new AtomicReference<>(new HashMap<>());

    /**
     * 这里为了优化 RoutePredicateHandlerMapping#lookupRoute 在路由非常多时导致SCG整体性能下降问题而准备的缓存
     * key-apiName#apiVersion(即Route的id),value-Route
     */
    private final AtomicReference> apiCache = new AtomicReference<>(new HashMap<>());

    public CachingRouteLocator(RouteLocator delegate) {
        this.delegate = delegate;
        buildDematerialize(cache.get(), apiCache.get());
    }

    @Override
    public Flux getRoutes() {
        return Flux.defer(() -> Flux.fromIterable(cache.get().get(KEY)).dematerialize());
    }

    /**
     * 提供一种新的方式,来规避当Route太多时 getRoutes() 给 RoutePredicateHandlerMapping#lookupRoute 带来的性能问题
     *
     * @return
     */
    public Flux> getRouteMap() {
        return Flux.just(apiCache.get());
    }

    /**
     * Clears the routes cache.
     *
     * @return routes flux
     */
    public Flux refresh() {
        Map newCache = new HashMap<>();
        Map newApiCache = new HashMap<>();
        buildDematerialize(newCache, newApiCache);
        apiCache.set(newApiCache);
        cache.set(newCache);

        return getRoutes();
    }

    /**
     * 参考 CacheFlux#lookup
     *
     * @return
     */
    private void buildDematerialize(Map newCache, Map newApiCache) {
        Flux.defer(() -> this.delegate.getRoutes()
                .sort(AnnotationAwareOrderComparator.INSTANCE)
                .materialize()
                .collectList()
                .doOnNext(signals -> newCache.put(KEY, signals))
                .flatMapIterable(Function.identity())
                .dematerialize()).subscribe();

        getRoutes().subscribe(route -> newApiCache.put(route.getId(), route));
    }

    @Override
    public void onApplicationEvent(RefreshRoutesEvent event) {
        refresh();
    }

    @Deprecated
        /* for testing */ void handleRefresh() {
        refresh();
    }

}

即新增了 apiCache 这个属性,这个就是我们想要的Map,让 apiCache 和之前的 cache 更新一致即可!!!

改完后,在线上环境进行压测,单台SCG网关从200QPS提升到了2400QPS!!!

当我发现还有其他团队遇到这种问题以后,给Spring Cloud Gateway提了个PR,但由于种种原因,最后没有成功,大家感兴趣可以看看:https://github.com/spring-cloud/spring-cloud-gateway/pull/2230

三、总结

SCG 和 Spring WebFlux 由于使用了Reactor库,有一定的学习和使用成本,加上SCG还在高速迭代中,能用最新版本最好用最新版本,包括底层的 reactor-netty 和 reactor-core 也在高速迭代,希望SCG发展越来越好!

你可能感兴趣的:(Spring,Cloud,Gateway,SCG,SCG踩坑日记,Spring,SCG性能优化)