其实问题排查过程不是很复杂,只是我们项目要远程,并且服务器是win,连jdk自带的一些工具都没有,arthas在windowns上兼容性也不好,浪费了很多时间。
首先,springboot的自动任务注解@EnableScheduling和@Scheduled底层原理是默认是单线程的,你也可以配置用自定义的线程池。所以,线上有自动任务突然不执行,肯定是有自动任务线程发生阻塞或者死锁,导致卡住了,其他自动任务都无法执行,对外呈现一个程序假死的情况。
首先简单说下多线程的配置方式,不是本文重点,简单记录下:
import org.springframework.boot.autoconfigure.batch.BatchProperties;
import org.springframework.context.annotation.Configuration;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.scheduling.annotation.SchedulingConfigurer;
import org.springframework.scheduling.config.ScheduledTaskRegistrar;
import java.lang.reflect.Method;
import java.util.concurrent.Executors;
@Configuration
public class ScheduleConfig implements SchedulingConfigurer {
@Override
public void configureTasks(ScheduledTaskRegistrar taskRegistrar) {
Method[] methods = BatchProperties.Job.class.getMethods();
int defaultPoolSize = 3;
int corePoolSize = 0;
if (methods != null && methods.length > 0) {
for (Method method : methods) {
Scheduled annotation = method.getAnnotation(Scheduled.class);
if (annotation != null) {
corePoolSize++;
}
}
if (defaultPoolSize > corePoolSize)
corePoolSize = defaultPoolSize;
}
taskRegistrar.setScheduler(Executors.newScheduledThreadPool(corePoolSize));
}
}
由于这个问题重启后就很难复现了,所以我没有立即重启,而是想办法通过jps jmap jstack这些好用的工具查看线程堆栈信息,看是不是发生死锁或者其他问题;可是坑爹的是,我们这个项目只有jre运行环境,所以我把相同版本的jdk下的jps jmap jhat jstack这些exe文件直接拷贝到jre的bin目录下,发现可以执行了。
首先查看下tomcat的进程号,然后
jstack pid
"scheduling-1" #287 prio=5 os_prio=0 tid=0x00000000259c3800 nid=0x138c8 runnable [0x000000004749d000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at oracle.net.ns.Packet.receive(Packet.java:300)
at oracle.net.ns.DataPacket.receive(DataPacket.java:106)
at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:315)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:260)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:185)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:102)
at oracle.jdbc.driver.T4CSocketInputStreamWrapper.readNextPacket(T4CSocketInputStreamWrapper.java:124)
at oracle.jdbc.driver.T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:80)
at oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1137)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:290)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:192)
at oracle.jdbc.driver.T4CTTIoauthenticate.doOAUTH(T4CTTIoauthenticate.java:380)
at oracle.jdbc.driver.T4CTTIoauthenticate.doOAUTH(T4CTTIoauthenticate.java:760)
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:401)
at oracle.jdbc.driver.PhysicalConnection.(PhysicalConnection.java:546)
at oracle.jdbc.driver.T4CConnection.(T4CConnection.java:236)
at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:32)
at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:521)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:247)
at com.hikvision.ctm05cqfskdj.module.service.impl.SyncPersonInfoServiceImp.getNewConnection(SyncPersonInfoServiceImp.java:189)
at com.hikvision.ctm05cqfskdj.module.service.impl.SyncPersonInfoServiceImp.syncPerson(SyncPersonInfoServiceImp.java:86)
at com.hikvision.ctm05cqfskdj.module.task.AutoSyncPersonToIscTask.autoSyncPersonTask(AutoSyncPersonToIscTask.java:37)
- locked <0x000000072a8cca68> (a com.hikvision.ctm05cqfskdj.module.task.AutoSyncPersonToIscTask$$EnhancerBySpringCGLIB$$b642cbea)
at sun.reflect.GeneratedMethodAccessor3553.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:343)
at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:752)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:88)
at org.springframework.cloud.sleuth.instrument.scheduling.TraceSchedulingAspect.traceBackgroundThread(TraceSchedulingAspect.java:70)
at sun.reflect.GeneratedMethodAccessor2405.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:644)
at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:633)
at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:70)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:93)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:688)
at com.hikvision.ctm05cqfskdj.module.task.AutoSyncPersonToIscTask$$EnhancerBySpringCGLIB$$ad984c90.autoSyncPersonTask()
at sun.reflect.GeneratedMethodAccessor3553.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84)
at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:93)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- <0x0000000721613c10> (a java.util.concurrent.ThreadPoolExecutor$Worker)
关于堆栈查看,很简单,关注以下几个点,另外,这里给出一个堆栈分析网站,可视化的,很牛逼:https://fastthread.io/
堆栈信息很多,忽略掉其他的,根据线程名称查scheduling-1,也没发现发现死锁,scheduling-1线程阻塞在了java.net.SocketInputStream.socketRead0(Native Method)这个本地方法,并且状态是RUNABLE,初步判断是这个地方阻塞,查看堆栈信息,发现代码行:
at com.hikvision.ctm05cqfskdj.module.service.impl.SyncPersonInfoServiceImp.getNewConnection(SyncPersonInfoServiceImp.java:189)
/**
* 每次人员同步都去清空连接池中的连接(不管是否有效),拿最新连接去执行自动任务
* */
public void getNewConnection(){
log.info("开始获取最新连接,放入连接池");
poolClear();
LinkedList pool = JdbcHelper.getInstance().getPool();
try {
log.info("解密前username={},password={}",username,password);
String decryptUserName = ConfigEncryptUtil.decrypt(username);
String decryptPassword = ConfigEncryptUtil.decrypt(password);
for(int i = 0; i < JDBC_POOL_SIZE; i++) {
Connection conn = DriverManager.getConnection(url, decryptUserName, decryptPassword);
if (null != conn && pool.size() < JDBC_POOL_SIZE){
pool.push(conn);
}
}
} catch (Exception e) {
log.error("获取连接异常.e={},size={}",e.getMessage(),pool.size());
}
log.info("获取连接结束。size={}",pool.size());
}
通过日志发现,"开始获取最新连接,放入连接池"这行日志没有与之对应的“获取连接结束”,说明这个方法阻塞了。再进一步定位,很容易就发现:
Connection conn = DriverManager.getConnection(url, decryptUserName, decryptPassword);
这行代码可能有问题,点进去看:
private static volatile int loginTimeout = 0;
/**
* Sets the maximum time in seconds that a driver will wait
* while attempting to connect to a database once the driver has
* been identified.
*
* @param seconds the login time limit in seconds; zero means there is no limit
* @see #getLoginTimeout
*/
public static void setLoginTimeout(int seconds) {
loginTimeout = seconds;
}
注意这个注释:登录时间限制以秒为单位;零表示没有限制 ,而默认值刚好是0;所以基本定位ok!
gc情况正常,CMS垃圾回收算是比较高效的并发收集器,不过最新的ZGC是未来的趋势,或许以后都不需要gc调优了!
Attaching to process ID 61008, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.181-b13
using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 3221225472 (3072.0MB)
NewSize = 536870912 (512.0MB)
MaxNewSize = 536870912 (512.0MB)
OldSize = 536870912 (512.0MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 0 (0.0MB)
Heap Usage:
New Generation (Eden + 1 Survivor Space):
capacity = 483196928 (460.8125MB)
used = 423850536 (404.21537017822266MB)
free = 59346392 (56.597129821777344MB)
87.71796992881544% used
Eden Space:
capacity = 429522944 (409.625MB)
used = 417381920 (398.0464172363281MB)
free = 12141024 (11.578582763671875MB)
97.17337009126106% used
From Space:
capacity = 53673984 (51.1875MB)
used = 6468616 (6.168952941894531MB)
free = 47205368 (45.01854705810547MB)
12.051678518963675% used
To Space:
capacity = 53673984 (51.1875MB)
used = 0 (0.0MB)
free = 53673984 (51.1875MB)
0.0% used
concurrent mark-sweep generation:
capacity = 1133961216 (1081.4296875MB)
used = 372137864 (354.89832305908203MB)
free = 761823352 (726.531364440918MB)
32.817512517112405% used
90956 interned Strings occupying 9060464 bytes.
现在的Java EE 系统通常都依赖于远程服务,需要通过网络请求获取远程服务。不管是使用什么协议(HTTP,HTTPS,JDBC 等)请求远程服务,Java VM最终都将委托给Socket去实现。首先会通过connect方法与远程服务建立连接,连接建立成功后就可以 write/read 与远程服务实现数据通信。Socket.connect()、Socket.write()、Socket.read() 方法调用都被称为阻塞IO调用。这些调用由于网络延时、服务器响应缓慢等问题往往会导致线程挂起。
通过日志发现,阻塞的时机发生在连接oracle的时候如果报错:
ORA-00257: 存檔器錯誤. 連線僅限內部, 直到釋出為止
出现这个错误,才发现了自动任务不执行的情况,表明这个错误会导致获取连接被阻塞;这个错误查阅资料发现是:
//单位s:设置10分钟
DriverManager.setLoginTimeout(10 * 60);
Connection conn = DriverManager.getConnection(url, decryptUserName, decryptPassword);
只能改了之后,看后面是否还出现自动任务不执行的情况了。一个月后评论给出验证结果!