问题:
2023-01-16, 21:33:22 2023-01-16 21:33:22.661 ERROR 7 --- [167930-thread-1] c.a.druid.pool.DruidPooledStatement : CommunicationsException, druid version 1.2.5, jdbcUrl : jdbc:mysql://bds.haier.net:9035/di_fw?connectTimeout=300000&socketTimeout=300000, testWhileIdle true, idle millis 19862, minIdle 10, poolingCount 1, timeBetweenEvictionRunsMillis 60000, lastValidIdleMillis 49499, driver com.mysql.jdbc.Driver, exceptionSorter com.alibaba.druid.pool.vendor.MySqlExceptionSorter
2023-01-16, 21:33:22 2023-01-16 21:33:22.661 ERROR 7 --- [167930-thread-1] com.alibaba.druid.pool.DruidDataSource : discard connection
2023-01-16, 21:33:22
2023-01-16, 21:33:22 com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
2023-01-16, 21:33:22
2023-01-16, 21:33:22 The last packet successfully received from the server was 49,500 milliseconds ago. The last packet sent successfully to the server was 49,515 milliseconds ago. at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:174) at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:64) at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:953) at com.mysql.cj.jdbc.ClientPreparedStatement.executeQuery$original$ICdb496O(ClientPreparedStatement.java:1003) at com.mysql.cj.jdbc.ClientPreparedStatement.executeQuery$original$ICdb496O$accessor$Y4IPLmuo(ClientPreparedStatement.java) at com.mysql.cj.jdbc.ClientPreparedStatement$auxiliary$DOhLhsKX.call(Unknown Source) at org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstMethodsInter.intercept(InstMethodsInter.java:86) at com.mysql.cj.jdbc.ClientPreparedStatement.executeQuery(ClientPreparedStatement.java) at com.alibaba.druid.pool.DruidPooledPreparedStatement.executeQuery(DruidPooledPreparedStatement.java:227) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure
2023-01-16, 21:33:22
2023-01-16, 21:33:22 The last packet successfully received from the server was 49,500 milliseconds ago. The last packet sent successfully to the server was 49,515 milliseconds ago. at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:61) at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:105) at com.mysql.cj.exceptions.ExceptionFactory.createException(ExceptionFactory.java:151) at com.mysql.cj.exceptions.ExceptionFactory.createCommunicationsException(ExceptionFactory.java:167) at com.mysql.cj.protocol.a.NativeProtocol.clearInputStream(NativeProtocol.java:815) at com.mysql.cj.protocol.a.NativeProtocol.sendCommand(NativeProtocol.java:630) at com.mysql.cj.protocol.a.NativeProtocol.sendQueryPacket(NativeProtocol.java:948) at com.mysql.cj.NativeSession.execSQL(NativeSession.java:1075) at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:930) ... 10 common frames omitted Caused by: java.io.IOException: Socket is closed at com.mysql.cj.protocol.AbstractSocketConnection.getMysqlInput(AbstractSocketConnection.java:72) at com.mysql.cj.protocol.a.NativeProtocol.clearInputStream(NativeProtocol.java:811) ... 14 common frames omitted
com.alibaba.druid.pool.DruidDataSource : discard connection
druid config:
testWhileIdle = true testOnBorrow = false timeBetweenEvictionRunsMillis = 60s minEvictableIdleTimeMillis = 5 min maxEvictableIdleTimeMillis = 50min
感觉 get connection出啥问题了:
public DruidPooledConnection getConnectionDirect(long maxWaitMillis) throws SQLException {
if (testOnBorrow) {...}
else {
if (testWhileIdle) {
...
if (idleMillis >= timeBetweenEvictionRunsMillis
|| idleMillis < 0 // unexcepted branch
) {
boolean validate = testConnectionInternal(poolableConnection.holder, poolableConnection.conn);
if (!validate) {
if (LOG.isDebugEnabled()) {
LOG.debug("skip not validate connection.");
}
discardConnection(poolableConnection.holder);
continue;
}
}
...
}
}
}
如果最近active的时间大于的判断周期(也就是60s),或没有active过,就要检查下连接,如果连接有问题,比如已经close了,validate就是false,就要discardConnection,discard的逻辑也很简单,就是close connection,然后更新池的状态;这里报错应该就是close的报错,为什么会报错的呢,猜测服务端已经关闭了连接。但是druid不是有清理线程的吗,为什么没有把已经close的connection的关掉呢?
继续看 清理线程逻辑:
启动清理线程
protected void createAndStartDestroyThread() {
destroyTask = new DestroyTask();
if (destroyScheduler != null) {
long period = timeBetweenEvictionRunsMillis;
...
destroySchedulerFuture = destroyScheduler.scheduleAtFixedRate(destroyTask, period, period,
TimeUnit.MILLISECONDS);
...
}
...
}
可以看出来,timeBetweenEvictionRunsMillis参数确定了清理线程的执行频率
public class DestroyTask implements Runnable {
public DestroyTask() {
}
@Override
public void run() {
shrink(true, keepAlive);
if (isRemoveAbandoned()) {
removeAbandoned();
}
}
}
可以看到,逻辑都在shrink里面。
这个方法主要干两件事情:
那什么情况下的连接会被认为是要清理的呢?
if (phyTimeoutMillis > 0) {
long phyConnectTimeMillis = currentTimeMillis - connection.connectTimeMillis;
if (phyConnectTimeMillis > phyTimeoutMillis) {
evictConnections[evictCount++] = connection;
continue;
}
}
2. 空闲时间 > min idel (1. > min idel count 2. 空闲时间>max idel)
if (idleMillis >= minEvictableIdleTimeMillis) {
if (checkTime && i < checkCount) {
evictConnections[evictCount++] = connection;
continue;
} else if (idleMillis > maxEvictableIdleTimeMillis) {
evictConnections[evictCount++] = connection;
continue;
}
}
3. 该场景 keep alive设置的false,不会触发这个逻辑
if (keepAlive && idleMillis >= keepAliveBetweenTimeMillis) {
keepAliveConnections[keepAliveCount++] = connection;
}
什么情况下,会认为连接不需要处理呢?
当目前空闲的时间小于最小空闲保持时间(5min),并且小于keepAliveBetweenTimeMillis(该参数默认2分钟)
long idleMillis = currentTimeMillis - connection.lastActiveTimeMillis;
if (idleMillis < minEvictableIdleTimeMillis
&& idleMillis < keepAliveBetweenTimeMillis
) {
break;
}
也就是说,两分钟内如果连接已经被关闭,或者已经失效,但这个连接不会被加入到清理的队列。但是正常数据库的超时时间是五分钟,是不会触发这个问题的,why?
后来从运维doris的同学口中得知,报错的那个时间点,正好重启了doris,那么整个原因明确了:因为重启了server端,所有的连接其实都close了,但是因为上述druid的逻辑,两分钟内连接没有加入清理队列,所有test idle的时候close就报错了。
服务告警从21:31持续到21:34后恢复正常,整个过程持续了3分钟左右,也从侧面证实了推断的正确
有什么办法能避免吗?
其实服务器都已经挂了,池中的连接已经都是无效的,唯一能做的就是加快清理连接的速度,但这样的调整,会拖慢平时的连接获取速度,所以其实啥都做不了.....