应用部署引起上游服务抖动问题分析及优化实践方案

作者:京东物流 朱永昌

背景介绍

本文主要围绕应用部署引起上游服务抖动问题展开,结合百川分流系统实例,提供分析、解决思路,并提供一套切实可行的实践方案。

百川分流系统作为交易订单中心的专用网关,为交易订单中心提供统一的对外标准服务(包括接单、修改、取消、回传等),对内则基于配置规则将流量分发到不同业务线的应用上。随着越来越多的流量切入百川系统,因系统部署引起服务抖动导致上游系统调用超时的问题也逐渐凸显出来。为提供稳定的交易服务系统,提升系统可用率,需要对该问题进行优化。

经调研,集团内部现有两种预热方案:

(1)JSF官方提供的预热方案;

(2)行云编排部署结合录制回放的预热方案。两种方法均无法达到预期效果。

关于方案

(1)首先,使用的前提条件是JSF消费端必需升级JSF版本到1.7.6,百川分流系统上游调用方有几十个,推动所有调用方升级版本比较困难;其次,JSF平台预热规则以接口纬度进行配置,百川分流系统对外提供46个接口,配置复杂;最关键的是该方案的预热规则配置的是在一个固定预热周期(比如1分钟)内某个接口的预热权重(接收调用量比例),简单理解就是小流量试跑,这就决定了该方案无法对系统资源进行充分预热,预热周期过后全部流量进入依然会因需要创建或初始化资源引起服务抖动,对于交易接单服务来说,抖动就会导致接单失败,有卡单风险。

关于方案

(2)通过录制线上流量进行压测回放来实现预热,适合读接口,但对于写接口如果不做特殊处理会影响线上数据;针对这个问题,目前的解决方案是通过压测标识来识别压测预热流量,但交易业务逻辑复杂,下游依赖繁多,相关系统目前并不支持。单独改造的话,接口多、风险高。

基于以上情况,我们通过百川分流系统部署引起上游服务抖动这个实例,追踪其表象线索,深入研读JSF源码,最终找到导致服务抖动的关键因素,开发了一套更加有效的预热方案,验证结果表明该方案预热效果明显,服务调用方方法性能MAX值降低90%,降到了超时时间范围内,消除了因机器部署引起上游调用超时的问题。

问题现象

系统上线部署期间,纯配接单服务上游调用方反馈接单服务抖动,出现调用超时现象。

查看此服务UMP打点,发现此服务的方法性能监控MAX值最大3073ms,未超过调用方设置的超时时间10000ms(如图1所示)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TrZZSKH7-1681440619079)(https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/8a9764a9b6e04f3c9fc513084c4b690b~noop.image?_iz=58558&from=article.pc_detail&x-expires=1682042388&x-signature=89xmyIwIf6yph54eMrwZChUHG5Q%3D)]

图1 服务内部监控打点

查看此服务PFinder性能监控,发现上游调用方应用调用此服务的方法性能监控MAX值多次超过10000ms(可以直接查看调用方的UMP打点,若调用方无法提供UMP打点时,也可借助PFinder的应用拓扑功能进行查看,如图2所示)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DZteODLj-1681440619082)(https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/aa10c8e3e01345b38b04c88bd508c556~noop.image?_iz=58558&from=article.pc_detail&x-expires=1682042388&x-signature=F1ikcDXFsg41H4j2SsQZH%2FJThVw%3D)]

图2 服务外部监控打点

分析思路

从上述问题现象可以看出,在系统上线部署期间服务提供方接口性能MAX值并无明显抖动,但服务调用方接口性能MAX值抖动明显。由此,可以确定耗时不在服务提供方内部处理逻辑上,而是在进入服务提供方内部处理逻辑之前(或者之后),那么在之前或者之后具体都经历了什么呢?我们不着急回答这个问题,先基于现有的一些线索逐步进行追踪探索。

线索一:部署过程中机器CPU会有短暂飙升(如图3所示)

如果此时有请求调用到当前机器,接口性能势必会受到影响。因此,考虑机器部署完成且待机器CPU平稳后再上线JSF服务,这可以通过调整JSF延迟发布参数来实现。具体配置如下:

 

然而,实践证明JSF服务确实延迟了2分钟才上线(如图4所示),且此时CPU已经处于平稳状态,但是JSF上线瞬间又引起了CPU的二次飙升,同时调用方仍然会出现服务调用超时的现象。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-beixIETX-1681440619085)(https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/e9c0ce9c746f4781b2c7d22190d21d02~noop.image?_iz=58558&from=article.pc_detail&x-expires=1682042388&x-signature=eTwjBs4fX2DFiBFJ1kqCgjoNIds%3D)]

图3 机器部署过程CPU短暂飙升

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tVNMzJSg-1681440619086)(https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/e8b173dff58f4f538e751afd935e2bf1~noop.image?_iz=58558&from=article.pc_detail&x-expires=1682042388&x-signature=A9QryRBUJ7wThqIFQK86444dZNY%3D)]

图4 部署和JSF上线瞬间均导致CPU飙升

线索二:JSF上线瞬间JVM线程数飙升(如图5所示)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kPMWYIab-1681440619088)(https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/c6b188c619f24be6affd0721ceab3455~noop.image?_iz=58558&from=article.pc_detail&x-expires=1682042388&x-signature=K6aGT4Upbj6dVRm6LxPrmsJTQMw%3D)]

图5 JSF上线瞬间线程数飙升

使用jstack命令工具查看线程堆栈,可以发现数量增长最多的线程是JSF-BZ线程,且都处于阻塞等待状态:

"JSF-BZ-22000-137-T-350" #1038 daemon prio=5 os_prio=0 tid=0x00007f02bcde9000 nid=0x6fff waiting on condition [0x00007efa10284000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000640b359e8> (a java.util.concurrent.SynchronousQueue$TransferStack)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:458)
	at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
	at java.util.concurrent.SynchronousQueue.take(SynchronousQueue.java:924)
	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
	- None

"JSF-BZ-22000-137-T-349" #1037 daemon prio=5 os_prio=0 tid=0x00007f02bcde7000 nid=0x6ffe waiting on condition [0x00007efa10305000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000640b359e8> (a java.util.concurrent.SynchronousQueue$TransferStack)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:458)
	at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
	at java.util.concurrent.SynchronousQueue.take(SynchronousQueue.java:924)
	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
	- None

"JSF-BZ-22000-137-T-348" #1036 daemon prio=5 os_prio=0 tid=0x00007f02bcdd8000 nid=0x6ffd waiting on condition [0x00007efa10386000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000640b359e8> (a java.util.concurrent.SynchronousQueue$TransferStack)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:458)
	at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
	at java.util.concurrent.SynchronousQueue.take(SynchronousQueue.java:924)
	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
	- None

...


通过关键字“JSF-BZ”可以在JSF源码中检索,可以找到关于“JSF-BZ”线程池初始化源码如下:

private static synchronized ThreadPoolExecutor initPool(ServerTransportConfig transportConfig) {
    final int minPoolSize, aliveTime, port = transportConfig.getPort();

    int maxPoolSize = transportConfig.getServerBusinessPoolSize();
    String poolType = transportConfig.getServerBusinessPoolType();
    if ("fixed".equals(poolType)) { minPoolSize = maxPoolSize;
    aliveTime = 0;
    } else if ("cached".equals(poolType)) { minPoolSize = 20;
    maxPoolSize = Math.max(minPoolSize, maxPoolSize);
    aliveTime = 60000;
    } else { throw new IllegalConfigureException(21401, "server.threadpool", poolType);
    }

    String queueType = transportConfig.getPoolQueueType();
    int queueSize = transportConfig.getPoolQueueSize();
    boolean isPriority = "priority".equals(queueType);
    BlockingQueue configQueue = ThreadPoolUtils.buildQueue(queueSize, isPriority);

    NamedThreadFactory threadFactory = new NamedThreadFactory("JSF-BZ-" + port, true);
    RejectedExecutionHandler handler = new RejectedExecutionHandler() {
        private int i = 1;

        public void rejectedExecution(Runnable r, ThreadPoolExecutor executor) { if (this.i++ % 7 == 0) {
            this.i = 1;
            BusinessPool.LOGGER.warn("[JSF-23002]Task:{} has been reject for ThreadPool exhausted! pool:{}, active:{}, queue:{}, taskcnt: {}", new Object[] { r, Integer.valueOf(executor.getPoolSize()), Integer.valueOf(executor.getActiveCount()), Integer.valueOf(executor.getQueue().size()), Long.valueOf(executor.getTaskCount()) });
        }

        RejectedExecutionException err = new RejectedExecutionException("[JSF-23003]Biz thread pool of provider has bean exhausted, the server port is " + port);

        ProviderErrorHook.getErrorHookInstance().onProcess(new ProviderErrorEvent(err));
        throw err;
        }
    };
    LOGGER.debug("Build " + poolType + " business pool for port " + port + " [min: " + minPoolSize + " max:" + maxPoolSize + " queueType:" + queueType + " queueSize:" + queueSize + " aliveTime:" + aliveTime + "]");

    return new ThreadPoolExecutor(minPoolSize, maxPoolSize, aliveTime, TimeUnit.MILLISECONDS, configQueue, (ThreadFactory)threadFactory, handler);
}

public static BlockingQueue buildQueue(int size, boolean isPriority) {
    BlockingQueue queue;
    if (size == 0) {
      queue = new SynchronousQueue();
    }
    else if (isPriority) {
      queue = (size < 0) ? new PriorityBlockingQueue() : new PriorityBlockingQueue(size);
    } else {
      queue = (size < 0) ? new LinkedBlockingQueue() : new LinkedBlockingQueue(size);
    } 
    
    return queue;
  }


另外,JSF官方文档关于线程池的说明如下:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XaFoZiOb-1681440619091)(https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/2197cca8b23e4438ad8cec39a8c859f7~noop.image?_iz=58558&from=article.pc_detail&x-expires=1682042388&x-signature=yI6Go8npbT%2BuT4WJpxdUwcLk6VI%3D)]

结合JSF源码以及JSF官方文档说明,可以知道JSF-BZ线程池的阻塞队列用的是SynchronousQueue,这是一个同步阻塞队列,其中每个put必须等待一个take,反之亦然。JSF-BZ线程池默认使用的是伸缩无队列线程池,初始线程数为20个,那么在JSF上线的瞬间,大批量并发请求进入,初始化线程远不够用,因此新建了大量线程。

既然知道了是由于JSF线程池初始化线程数量不足导致的,那么我们可以考虑在应用启动时对JSF线程池进行预热,也就是说在应用启动时创建足够数量的线程备用。通过查阅JSF源码,我们找到了如下方式实现JSF线程池的预热:

// 从Spring上下文获取JSF ServerBean,可能有多个
Map serverBeanMap = applicationContext.getBeansOfType(ServerBean.class);
if (CollectionUtils.isEmpty(serverBeanMap)) {
    log.error("application preheat, jsf thread pool preheat failed, serverBeanMap is empty.");
    return;
}

// 遍历所有serverBean,分别做预热处理
serverBeanMap.forEach((serverBeanName, serverBean) -> {
    if (Objects.isNull(serverBean)) {
        log.error("application preheat, jsf thread pool preheat failed, serverBean is null, serverBeanName:{}", serverBeanName);
        return;
    }
    // 启动ServerBean,启动后才可以获取到Server
    serverBean.start();
    Server server = serverBean.getServer();
    if (Objects.isNull(server)) {
        log.error("application preheat, jsf thread pool preheat failed, JSF Server is null, serverBeanName:{}", serverBeanName);
        return;
    }

    ServerTransportConfig serverTransportConfig = server.getTransportConfig();
    if (Objects.isNull(serverTransportConfig)) {
        log.error("application preheat, jsf thread pool preheat failed, serverTransportConfig is null, serverBeanName:{}", serverBeanName);
        return;
    }
    // 获取JSF业务线程池
    ThreadPoolExecutor businessPool = BusinessPool.getBusinessPool(serverTransportConfig);
    if (Objects.isNull(businessPool)) {
        log.error("application preheat, jsf biz pool preheat failed, businessPool is null, serverBeanName:{}", serverBeanName);
        return;
    }

    int corePoolSize = businessPool.getCorePoolSize();
    int maxCorePoolSize = Math.max(corePoolSize, 500);

    if (maxCorePoolSize > corePoolSize) {
        // 设置JSF server核心线程数
        businessPool.setCorePoolSize(maxCorePoolSize);
    }
    // 初始化JSF业务线程池所有核心线程
    if (businessPool.getPoolSize() < maxCorePoolSize) {
        businessPool.prestartAllCoreThreads();
    }
}


线索三:JSF-BZ线程池预热完成后,JSF上线瞬间JVM线程数仍有升高

继续使用jstack命令工具查看线程堆栈,对比后可以发现数量有增长的线程是JSF-SEV-WORKER线程:

"JSF-SEV-WORKER-139-T-129" #1295 daemon prio=5 os_prio=0 tid=0x00007ef66000b800 nid=0x7289 runnable [0x00007ef627cf8000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
	- locked <0x0000000644f558b8> (a io.netty.channel.nio.SelectedSelectionKeySet)
	- locked <0x0000000641eaaca0> (a java.util.Collections$UnmodifiableSet)
	- locked <0x0000000641eaab88> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
	at io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:68)
	at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:805)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:457)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
	- None

"JSF-SEV-WORKER-139-T-128" #1293 daemon prio=5 os_prio=0 tid=0x00007ef60c002800 nid=0x7288 runnable [0x00007ef627b74000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
	- locked <0x0000000641ea7450> (a io.netty.channel.nio.SelectedSelectionKeySet)
	- locked <0x0000000641e971e8> (a java.util.Collections$UnmodifiableSet)
	- locked <0x0000000641e970d0> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
	at io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:68)
	at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:805)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:457)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
	- None

"JSF-SEV-WORKER-139-T-127" #1291 daemon prio=5 os_prio=0 tid=0x00007ef608001000 nid=0x7286 runnable [0x00007ef627df9000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
	- locked <0x0000000641e93998> (a io.netty.channel.nio.SelectedSelectionKeySet)
	- locked <0x0000000641e83730> (a java.util.Collections$UnmodifiableSet)
	- locked <0x0000000641e83618> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
	at io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:68)
	at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:805)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:457)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
	- None


那么JSF-SEV-WORKER线程是做什么的?我们是不是也可以对它做预热操作?带着这些疑问,再次查阅JSF源码:

private synchronized EventLoopGroup initChildEventLoopGroup() {
     NioEventLoopGroup nioEventLoopGroup = null;
     int threads = (this.childNioEventThreads > 0) ? this.childNioEventThreads : Math.max(8, Constants.DEFAULT_IO_THREADS);
 
     NamedThreadFactory threadName = new NamedThreadFactory("JSF-SEV-WORKER", isDaemon());
     EventLoopGroup eventLoopGroup = null;
     if (isUseEpoll()) {
       EpollEventLoopGroup epollEventLoopGroup = new EpollEventLoopGroup(threads, (ThreadFactory)threadName);
     } else {
       nioEventLoopGroup = new NioEventLoopGroup(threads, (ThreadFactory)threadName);
     } 
     return (EventLoopGroup)nioEventLoopGroup;
}


从JSF源码中可以看出JSF-SEV-WORKER线程是JSF内部使用Netty处理网络通信创建的线程,仔细研读JSF源码同样可以找到预热JSF-SEV-WORKER线程的方法,代码如下:

// 通过serverTransportConfig获取NioEventLoopGroup
// 其中,serverTransportConfig的获取方式可参考JSF-BZ线程预热代码
NioEventLoopGroup eventLoopGroup = (NioEventLoopGroup) serverTransportConfig.getChildEventLoopGroup();

int threadSize = this.jsfSevWorkerThreads;
while (threadSize-- > 0) {
    new Thread(() -> {
        // 通过手工提交任务的方式创建JSF-SEV-WORKER线程达到预热效果
        eventLoopGroup.submit(() -> log.info("submit thread to netty by hand, threadName:{}", Thread.currentThread().getName()));
    }).start();
}


JSF-BZ线程、JSF-SEV-WORKER线程预热效果如下图所示:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mbyBFVir-1681440619094)(https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/1a82339439d941e38c78a7f2411e0def~noop.image?_iz=58558&from=article.pc_detail&x-expires=1682042388&x-signature=c%2B%2F7RrzthbDd6tY5Hpq997S12YI%3D)]

图6 JSF-BZ/JSF-SEV-WORKER线程预热效果

挖掘源码线索

至此,经过JSF延迟发布、JSF内部线程池预热后,系统部署引起服务调用方抖动超时的现象有一定缓解(从原来的10000ms-20000ms降低到5000ms-10000ms),虽然说是有效果,但还有些不尽如人意。应该还是有优化空间的,现在是时候考虑我们最开始留下的那个疑问了:“服务调用方在进入服务提供方内部处理逻辑之前(或者之后),具体都经历了什么?”。最容易想到的肯定是中间经过了网络,但是网络因素基本可以排除,因为在部署过程中机器网络性能正常,那么还有哪些影响因素呢?此时我们还是要回归到JSF源码中去寻找线索。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1xp623IT-1681440619096)(https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/6d0ae10b6f4d48a2b3192251c59e9fc3~noop.image?_iz=58558&from=article.pc_detail&x-expires=1682042388&x-signature=OSA%2FHpJbaUeluT0Bfk31CoKVE%2BE%3D)]

图7 JSF源码中Provider内部处理过程

经过仔细研读JSF源码,我们可以发现JSF内部对于接口出入参有一系列编码、解码、序列化、反序列化的操作,而且在这些操作中我们有了惊喜的发现:本地缓存,部分源码如下:

DESC_CLASS_CACHE

private static final ConcurrentMap> DESC_CLASS_CACHE = new ConcurrentHashMap>();

private static Class desc2class(ClassLoader cl, String desc) throws ClassNotFoundException {
  switch (desc.charAt(0)) {
    case 'V':
      return void.class;
    case 'Z': return boolean.class;
    case 'B': return byte.class;
    case 'C': return char.class;
    case 'D': return double.class;
    case 'F': return float.class;
    case 'I': return int.class;
    case 'J': return long.class;
    case 'S': return short.class;
    case 'L':
      desc = desc.substring(1, desc.length() - 1).replace('/', '.');
      break;
    case '[':
      desc = desc.replace('/', '.');
      break;
    default:
      throw new ClassNotFoundException("Class not found: " + desc);
  } 
  
  if (cl == null)
    cl = ClassLoaderUtils.getCurrentClassLoader(); 
  Class clazz = DESC_CLASS_CACHE.get(desc);
  if (clazz == null) {
    clazz = Class.forName(desc, true, cl);
    DESC_CLASS_CACHE.put(desc, clazz);
  } 
  return clazz;
}


NAME_CLASS_CACHE

private static final ConcurrentMap> NAME_CLASS_CACHE = new ConcurrentHashMap>();

private static Class name2class(ClassLoader cl, String name) throws ClassNotFoundException {
  int c = 0, index = name.indexOf('[');
  if (index > 0) {
    
    c = (name.length() - index) / 2;
    name = name.substring(0, index);
  } 
  if (c > 0) {
    
    StringBuilder sb = new StringBuilder();
    while (c-- > 0) {
      sb.append("[");
    }
    if ("void".equals(name)) { sb.append('V'); }
    else if ("boolean".equals(name)) { sb.append('Z'); }
    else if ("byte".equals(name)) { sb.append('B'); }
    else if ("char".equals(name)) { sb.append('C'); }
    else if ("double".equals(name)) { sb.append('D'); }
    else if ("float".equals(name)) { sb.append('F'); }
    else if ("int".equals(name)) { sb.append('I'); }
    else if ("long".equals(name)) { sb.append('J'); }
    else if ("short".equals(name)) { sb.append('S'); }
    else { sb.append('L').append(name).append(';'); }
     name = sb.toString();
  }
  else {
    
    if ("void".equals(name)) return void.class; 
    if ("boolean".equals(name)) return boolean.class; 
    if ("byte".equals(name)) return byte.class; 
    if ("char".equals(name)) return char.class; 
    if ("double".equals(name)) return double.class; 
    if ("float".equals(name)) return float.class; 
    if ("int".equals(name)) return int.class; 
    if ("long".equals(name)) return long.class; 
    if ("short".equals(name)) return short.class;
  
  } 
  if (cl == null)
    cl = ClassLoaderUtils.getCurrentClassLoader(); 
  Class clazz = NAME_CLASS_CACHE.get(name);
  if (clazz == null) {
    clazz = Class.forName(name, true, cl);
    NAME_CLASS_CACHE.put(name, clazz);
  } 
  return clazz;
}


SerializerCache

private ConcurrentHashMap _cachedSerializerMap;

public Serializer getSerializer(Class cl) throws HessianProtocolException {
  Serializer serializer = (Serializer)_staticSerializerMap.get(cl);
  if (serializer != null) {
    return serializer;
  }
  
  if (this._cachedSerializerMap != null) {
    serializer = (Serializer)this._cachedSerializerMap.get(cl);
    if (serializer != null) {
      return serializer;
    }
  } 
  
  int i = 0;
  for (; serializer == null && this._factories != null && i < this._factories.size(); 
    i++) {

    
    AbstractSerializerFactory factory = this._factories.get(i);
    
    serializer = factory.getSerializer(cl);
  } 
  
  if (serializer == null)
  {
    if (isZoneId(cl)) {
      ZoneIdSerializer zoneIdSerializer = ZoneIdSerializer.getInstance();
    } else if (isEnumSet(cl)) {
      serializer = EnumSetSerializer.getInstance();
    } else if (JavaSerializer.getWriteReplace(cl) != null) {
      serializer = new JavaSerializer(cl, this._loader);
    }
    else if (HessianRemoteObject.class.isAssignableFrom(cl)) {
      serializer = new RemoteSerializer();


    
    }
    else if (Map.class.isAssignableFrom(cl)) {
      if (this._mapSerializer == null) {
        this._mapSerializer = new MapSerializer();
      }
      serializer = this._mapSerializer;
    } else if (Collection.class.isAssignableFrom(cl)) {
      if (this._collectionSerializer == null) {
        this._collectionSerializer = new CollectionSerializer();
      }
      
      serializer = this._collectionSerializer;
    } else if (cl.isArray()) {
      serializer = new ArraySerializer();
    } else if (Throwable.class.isAssignableFrom(cl)) {
      serializer = new ThrowableSerializer(cl, getClassLoader());
    } else if (InputStream.class.isAssignableFrom(cl)) {
      serializer = new InputStreamSerializer();
    } else if (Iterator.class.isAssignableFrom(cl)) {
      serializer = IteratorSerializer.create();
    } else if (Enumeration.class.isAssignableFrom(cl)) {
      serializer = EnumerationSerializer.create();
    } else if (Calendar.class.isAssignableFrom(cl)) {
      serializer = CalendarSerializer.create();
    } else if (Locale.class.isAssignableFrom(cl)) {
      serializer = LocaleSerializer.create();
    } else if (Enum.class.isAssignableFrom(cl)) {
      serializer = new EnumSerializer(cl);
    } 
  }
  if (serializer == null) {
    serializer = getDefaultSerializer(cl);
  }
  
  if (this._cachedSerializerMap == null) {
    this._cachedSerializerMap = new ConcurrentHashMap(8);
  }
  
  this._cachedSerializerMap.put(cl, serializer);
  
  return serializer;
}


DeserializerCache

private ConcurrentHashMap _cachedDeserializerMap;

public Deserializer getDeserializer(Class cl) throws HessianProtocolException {
  Deserializer deserializer = (Deserializer)_staticDeserializerMap.get(cl);
  if (deserializer != null) {
    return deserializer;
  }
  if (this._cachedDeserializerMap != null) {
    deserializer = (Deserializer)this._cachedDeserializerMap.get(cl);
    if (deserializer != null) {
      return deserializer;
    }
  } 
  
  int i = 0;
  for (; deserializer == null && this._factories != null && i < this._factories.size(); 
    i++) {
    
    AbstractSerializerFactory factory = this._factories.get(i);
    
    deserializer = factory.getDeserializer(cl);
  } 
  
  if (deserializer == null)
    if (Collection.class.isAssignableFrom(cl)) {
      deserializer = new CollectionDeserializer(cl);
    }
    else if (Map.class.isAssignableFrom(cl)) {
      deserializer = new MapDeserializer(cl);
    }
    else if (cl.isInterface()) {
      deserializer = new ObjectDeserializer(cl);
    }
    else if (cl.isArray()) {
      deserializer = new ArrayDeserializer(cl.getComponentType());
    }
    else if (Enumeration.class.isAssignableFrom(cl)) {
      deserializer = EnumerationDeserializer.create();
    }
    else if (Enum.class.isAssignableFrom(cl)) {
      deserializer = new EnumDeserializer(cl);
    }
    else if (Class.class.equals(cl)) {
      deserializer = new ClassDeserializer(this._loader);
    } else {
      
      deserializer = getDefaultDeserializer(cl);
    }  
  if (this._cachedDeserializerMap == null) {
    this._cachedDeserializerMap = new ConcurrentHashMap(8);
  }
  this._cachedDeserializerMap.put(cl, deserializer);
  
  return deserializer;
}


如上述源码所示,我们找到了四个本地缓存,遗憾的是,这四个本地缓存都是私有的,我们并不能直接对其进行初始化。但是我们还是从源码中找到了可以间接对这四个本地缓存进行初始化预热的方法,代码如下:

DESC_CLASS_CACHE、NAME_CLASS_CACHE预热代码

// DESC_CLASS_CACHE预热
ReflectUtils.desc2classArray(ReflectUtils.getDesc(Class.forName("cn.jdl.oms.express.model.CreateExpressOrderRequest")));
// NAME_CLASS_CACHE预热
ReflectUtils.name2class("cn.jdl.oms.express.model.CreateExpressOrderRequest");



SerializerCache、DeserializerCache预热代码

public class JsfSerializerFactoryPreheat extends HessianSerializerFactory {

    public static void doPreheat(String className) {
        try {
            // 序列化
            JsfSerializerFactoryPreheat.SERIALIZER_FACTORY.getSerializer(Class.forName("cn.jdl.oms.express.model.CreateExpressOrderRequest"));
            // 反序列化
            JsfSerializerFactoryPreheat.SERIALIZER_FACTORY.getDeserializer(Class.forName(className));
        } catch (Exception e) {
            // do nothing
            log.error("JsfSerializerFactoryPreheat failed:", e);
        }
    }
}


由JSF源码对于接口出入参编码、解码、序列化、反序列化操作,我们又想到应用接口内部有对出入参进行Fastjson序列化的操作,而且Fastjson序列化时需要初始化SerializeConfig,对性能会有一定影响(可参考
https://www.ktanx.com/blog/p/3181)。我们可以通过以下代码对Fastjson进行初始化预热:

JSON.parseObject(JSON.toJSONString(Class.forName("cn.jdl.oms.express.model.CreateExpressOrderRequest").newInstance()), Class.forName("cn.jdl.oms.express.model.CreateExpressOrderRequest"));


到目前为止,我们针对应用启动预热做了以下工作:

•JSF延迟发布

•JSF-BZ线程池预热

•JSF-SEV-WORKER线程预热

•JSF编码、解码、序列化、反序列化缓存预热

•Fastjson初始化预热

经过以上预热操作,应用部署引起服务抖动的现象得到了明显改善,由治理前的10000ms-20000ms降低到了 2000ms-3000ms(略高于日常流量抖动幅度)。

解决方案

基于以上分析,将JSF线程池预热、本地缓存预热、Fastjson预热整合打包,提供了一个简单可用的预热小工具,Jar包已上传私服,如有意向请参考使用说明:应用启动预热工具使用说明。

应用部署导致服务抖动属于一个共性问题,针对此问题目前有如下可选方案:

1、JSF官方提供的预热方案
https://cf.jd.com/pages/viewpage.action?pageId=1132755015)

原理:利用JSF1.7.6的预热策略动态下发,通过服务器负载均衡能力,对于上线需要预热的接口进行流量权重调整,小流量试跑,达到预热目的。

优点:平台配置即可,接入成本低。

缺点:按权重预热,资源预热不充分;需要服务调用方JSF版本升级到1.7.6,对于上游调用方较多的情况下推动版本升级困难。

2、流量录制回放预热方案

原理:录制线上真实流量,然后通过压测的方式将流量回放到新部署机器达到预热目的。

优点:结合了行云部署编排,下线、部署、预热、上线,以压测的方式可以使得预热更加充分。

缺点:使用流程较繁琐;仅对读接口友好,写接口需要关注数据是否对线上有影响。

3、本文方案

原理:通过对服务提供方JSF线程池、本地缓存、Fastjson进行初始化的方式进行系统预热。

优点:资源预热充分;使用简单,支持自定义扩展。

缺点:对除JSF以外的其他中间件如Redis、ES等暂不支持,但可以通过自定义扩展实现。

预热效果

预热前:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NcZ234LI-1681440619097)(https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/4285ddcc52534432a54f55b64b7ee3c0~noop.image?_iz=58558&from=article.pc_detail&x-expires=1682042388&x-signature=ZJmKKkmIBNEZlYcqp%2FrbwzEk9PY%3D)]

预热后:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-74uc5GHw-1681440619098)(https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/ff2dc94616a24667a1cb5f5e7a753bb4~noop.image?_iz=58558&from=article.pc_detail&x-expires=1682042388&x-signature=d3zBIESA%2FKqVztdIi5BZB8N0tGo%3D)]

使用本文提供的预热工具,预热前后对比效果明显,如上图所示,调用方方法性能MAX值从原来的10000ms-20000ms降低到了2000ms-3000ms,已经基本接近日常MAX抖点。

总结

应用部署引起上游服务抖动是一个常见问题,如果上游系统对服务抖动比较敏感,或会因此造成业务影响的话,这个问题还是需要引起我们足够的重视与关注。本文涉及的百川分流系统,单纯对外提供JSF服务,且无其他中间件的引入,特点是接口多,调用量大。

此问题在系统运行前期并不明显,上线部署上游基本无感,但随着调用量的增长,问题才逐渐凸显出来,如果单纯通过扩容也是可以缓解这个问题,但是这样会带来很大的资源浪费,违背“降本”的原则。为此,从已有线索出发,逐步深挖JSF源码,对线程池、本地缓存等在系统启动时进行充分初始化预热操作,从而有效降低JSF上线瞬间的服务抖动。

你可能感兴趣的:(技术分享,上手实操,运维,大数据,java,后端)