记录一次springboot自动任务线上突然不执行问题排查过程

      其实问题排查过程不是很复杂,只是我们项目要远程,并且服务器是win,连jdk自带的一些工具都没有,arthas在windowns上兼容性也不好,浪费了很多时间。

1、简述

       首先,springboot的自动任务注解@EnableScheduling和@Scheduled底层原理是默认是单线程的,你也可以配置用自定义的线程池。所以,线上有自动任务突然不执行,肯定是有自动任务线程发生阻塞或者死锁,导致卡住了,其他自动任务都无法执行,对外呈现一个程序假死的情况。

       首先简单说下多线程的配置方式,不是本文重点,简单记录下:

import org.springframework.boot.autoconfigure.batch.BatchProperties;
import org.springframework.context.annotation.Configuration;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.scheduling.annotation.SchedulingConfigurer;
import org.springframework.scheduling.config.ScheduledTaskRegistrar;
import java.lang.reflect.Method;
import java.util.concurrent.Executors;
@Configuration
public class ScheduleConfig implements SchedulingConfigurer {
  @Override
  public void configureTasks(ScheduledTaskRegistrar taskRegistrar) {
    Method[] methods = BatchProperties.Job.class.getMethods();
    int defaultPoolSize = 3;
    int corePoolSize = 0;
    if (methods != null && methods.length > 0) {
      for (Method method : methods) {
        Scheduled annotation = method.getAnnotation(Scheduled.class);
        if (annotation != null) {
          corePoolSize++;
        }
      }
      if (defaultPoolSize > corePoolSize)
        corePoolSize = defaultPoolSize;
    }
    taskRegistrar.setScheduler(Executors.newScheduledThreadPool(corePoolSize));
  }
}

           由于这个问题重启后就很难复现了,所以我没有立即重启,而是想办法通过jps jmap jstack这些好用的工具查看线程堆栈信息,看是不是发生死锁或者其他问题;可是坑爹的是,我们这个项目只有jre运行环境,所以我把相同版本的jdk下的jps jmap jhat jstack这些exe文件直接拷贝到jre的bin目录下,发现可以执行了。

2、jstack查看线程

首先查看下tomcat的进程号,然后

jstack pid

"scheduling-1" #287 prio=5 os_prio=0 tid=0x00000000259c3800 nid=0x138c8 runnable [0x000000004749d000]
   java.lang.Thread.State: RUNNABLE
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at oracle.net.ns.Packet.receive(Packet.java:300)
	at oracle.net.ns.DataPacket.receive(DataPacket.java:106)
	at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:315)
	at oracle.net.ns.NetInputStream.read(NetInputStream.java:260)
	at oracle.net.ns.NetInputStream.read(NetInputStream.java:185)
	at oracle.net.ns.NetInputStream.read(NetInputStream.java:102)
	at oracle.jdbc.driver.T4CSocketInputStreamWrapper.readNextPacket(T4CSocketInputStreamWrapper.java:124)
	at oracle.jdbc.driver.T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:80)
	at oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1137)
	at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:290)
	at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:192)
	at oracle.jdbc.driver.T4CTTIoauthenticate.doOAUTH(T4CTTIoauthenticate.java:380)
	at oracle.jdbc.driver.T4CTTIoauthenticate.doOAUTH(T4CTTIoauthenticate.java:760)
	at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:401)
	at oracle.jdbc.driver.PhysicalConnection.(PhysicalConnection.java:546)
	at oracle.jdbc.driver.T4CConnection.(T4CConnection.java:236)
	at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:32)
	at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:521)
	at java.sql.DriverManager.getConnection(DriverManager.java:664)
	at java.sql.DriverManager.getConnection(DriverManager.java:247)
	at com.hikvision.ctm05cqfskdj.module.service.impl.SyncPersonInfoServiceImp.getNewConnection(SyncPersonInfoServiceImp.java:189)
	at com.hikvision.ctm05cqfskdj.module.service.impl.SyncPersonInfoServiceImp.syncPerson(SyncPersonInfoServiceImp.java:86)
	at com.hikvision.ctm05cqfskdj.module.task.AutoSyncPersonToIscTask.autoSyncPersonTask(AutoSyncPersonToIscTask.java:37)
	- locked <0x000000072a8cca68> (a com.hikvision.ctm05cqfskdj.module.task.AutoSyncPersonToIscTask$$EnhancerBySpringCGLIB$$b642cbea)
	at sun.reflect.GeneratedMethodAccessor3553.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:343)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:752)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
	at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:88)
	at org.springframework.cloud.sleuth.instrument.scheduling.TraceSchedulingAspect.traceBackgroundThread(TraceSchedulingAspect.java:70)
	at sun.reflect.GeneratedMethodAccessor2405.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:644)
	at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:633)
	at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:70)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
	at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:93)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
	at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:688)
	at com.hikvision.ctm05cqfskdj.module.task.AutoSyncPersonToIscTask$$EnhancerBySpringCGLIB$$ad984c90.autoSyncPersonTask()
	at sun.reflect.GeneratedMethodAccessor3553.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84)
	at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
	at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:93)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
	- <0x0000000721613c10> (a java.util.concurrent.ThreadPoolExecutor$Worker)

关于堆栈查看,很简单,关注以下几个点,另外,这里给出一个堆栈分析网站,可视化的,很牛逼:https://fastthread.io/
 

记录一次springboot自动任务线上突然不执行问题排查过程_第1张图片

记录一次springboot自动任务线上突然不执行问题排查过程_第2张图片

      堆栈信息很多,忽略掉其他的,根据线程名称查scheduling-1,也没发现发现死锁,scheduling-1线程阻塞在了java.net.SocketInputStream.socketRead0(Native Method)这个本地方法,并且状态是RUNABLE,初步判断是这个地方阻塞,查看堆栈信息,发现代码行:

at com.hikvision.ctm05cqfskdj.module.service.impl.SyncPersonInfoServiceImp.getNewConnection(SyncPersonInfoServiceImp.java:189)
/**
     * 每次人员同步都去清空连接池中的连接(不管是否有效),拿最新连接去执行自动任务
     * */
    public void getNewConnection(){
        log.info("开始获取最新连接,放入连接池");
        poolClear();
        LinkedList pool = JdbcHelper.getInstance().getPool();
        try {
            log.info("解密前username={},password={}",username,password);
            String decryptUserName = ConfigEncryptUtil.decrypt(username);
            String decryptPassword = ConfigEncryptUtil.decrypt(password);
            for(int i = 0; i < JDBC_POOL_SIZE; i++) {
                Connection conn = DriverManager.getConnection(url, decryptUserName, decryptPassword);
                if (null != conn && pool.size() < JDBC_POOL_SIZE){
                    pool.push(conn);
                }
            }
        } catch (Exception e) {
            log.error("获取连接异常.e={},size={}",e.getMessage(),pool.size());
        }
        log.info("获取连接结束。size={}",pool.size());
    }


通过日志发现,"开始获取最新连接,放入连接池"这行日志没有与之对应的“获取连接结束”,说明这个方法阻塞了。再进一步定位,很容易就发现:

Connection conn = DriverManager.getConnection(url, decryptUserName, decryptPassword);

这行代码可能有问题,点进去看:

private static volatile int loginTimeout = 0;

/**
     * Sets the maximum time in seconds that a driver will wait
     * while attempting to connect to a database once the driver has
     * been identified.
     *
     * @param seconds the login time limit in seconds; zero means there is no limit
     * @see #getLoginTimeout
     */
    public static void setLoginTimeout(int seconds) {
        loginTimeout = seconds;
    }

注意这个注释:登录时间限制以秒为单位;零表示没有限制 ,而默认值刚好是0;所以基本定位ok!

3、查看GC情况

gc情况正常,CMS垃圾回收算是比较高效的并发收集器,不过最新的ZGC是未来的趋势,或许以后都不需要gc调优了!

Attaching to process ID 61008, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.181-b13

using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC

Heap Configuration:
   MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 3221225472 (3072.0MB)
   NewSize                  = 536870912 (512.0MB)
   MaxNewSize               = 536870912 (512.0MB)
   OldSize                  = 536870912 (512.0MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 483196928 (460.8125MB)
   used     = 423850536 (404.21537017822266MB)
   free     = 59346392 (56.597129821777344MB)
   87.71796992881544% used
Eden Space:
   capacity = 429522944 (409.625MB)
   used     = 417381920 (398.0464172363281MB)
   free     = 12141024 (11.578582763671875MB)
   97.17337009126106% used
From Space:
   capacity = 53673984 (51.1875MB)
   used     = 6468616 (6.168952941894531MB)
   free     = 47205368 (45.01854705810547MB)
   12.051678518963675% used
To Space:
   capacity = 53673984 (51.1875MB)
   used     = 0 (0.0MB)
   free     = 53673984 (51.1875MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 1133961216 (1081.4296875MB)
   used     = 372137864 (354.89832305908203MB)
   free     = 761823352 (726.531364440918MB)
   32.817512517112405% used

90956 interned Strings occupying 9060464 bytes.

4、原因:socketRead0阻塞

       现在的Java EE 系统通常都依赖于远程服务,需要通过网络请求获取远程服务。不管是使用什么协议(HTTP,HTTPS,JDBC 等)请求远程服务,Java VM最终都将委托给Socket去实现。首先会通过connect方法与远程服务建立连接,连接建立成功后就可以 write/read 与远程服务实现数据通信。Socket.connect()、Socket.write()、Socket.read() 方法调用都被称为阻塞IO调用。这些调用由于网络延时、服务器响应缓慢等问题往往会导致线程挂起。

5、何时出现阻塞

通过日志发现,阻塞的时机发生在连接oracle的时候如果报错:

ORA-00257: 存檔器錯誤. 連線僅限內部, 直到釋出為止

出现这个错误,才发现了自动任务不执行的情况,表明这个错误会导致获取连接被阻塞;这个错误查阅资料发现是:

记录一次springboot自动任务线上突然不执行问题排查过程_第3张图片

 

5、解决方案

//单位s:设置10分钟
DriverManager.setLoginTimeout(10 * 60);
Connection conn = DriverManager.getConnection(url, decryptUserName, decryptPassword);

6、验证是否生效

只能改了之后,看后面是否还出现自动任务不执行的情况了。一个月后评论给出验证结果!

你可能感兴趣的:(java核心知识)