一、背景
Mysql 的DBA给Mysql定义一套规则,mysql 服务器端的默认的超时时间wait_timeout为8小时,但DBA把wait_timeout改为600秒,我估计这规则本意是减少数据库的长时间链接的情况,只要链接空闲超过600秒,服务器端会自动断开链接。所有产生的影响必须由客户端程序来保障。
1.1、版本说明:
mysql:5.7.17
druid:1.1.5
mysql-connector-java:5.1.44
1.2、druid重要属性配置
详细配置请见官网文档:DruidDataSource配置属性列表
二、现象
系统上线一段时间后,在监控时而报错如下:
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 72,557 milliseconds ago. The last packet sent successfully to the server was 0 milliseconds ago.
根据错误日志初步判断肯定是 mysql服务端把链接已经断开,但客户端不知道并,依然尝试使用了一个已经断开的链接才会引起这个错误发生。但是根据我们对 druid 了解,druid 有链接检查功能,按理不会拿到一个无效链接才对。
三、分析
3.1、整体分析
图片表示了druid在获取线程池的大致的逻辑过程:druid在初始化时会创建两个守护线程,分别承担线程的创建(CreateConnectionThread)和销毁任务(DestoryConnectionThread),
当用户线程出现等待获取线程的操作时(且线程池中的线程数不大于最大活动线程数),创建线程会自动创建新的连接并放到线程池中,所以当用户线程需要新的连接时,只需要直接从线程池获取即可。
当用户线程从线程池中获取到连接会根据用户的配置决定是否线程进行有效性验证,如果验证线程有效则返回线程,如果无效则将该连接关闭,(DestoryConnectionThread自动回收已关闭的连接)
3.2、线程创建及销毁任务
程序启动在创建数据连接时,会自动创建两个任务(job),也就是CreateConnectionThread和DestoryConnectionThread
- CreateConnectionThread比较简单,也是个守候线程,代码如下:
public class CreateConnectionThread extends Thread {
public CreateConnectionThread(String name){
super(name);
this.setDaemon(true);
}
public void run() {
initedLatch.countDown();
long lastDiscardCount = 0;
int errorCount = 0;
for (;;) {
// addLast
try {
lock.lockInterruptibly();
} catch (InterruptedException e2) {
break;
}
long discardCount = DruidDataSource.this.discardCount;
boolean discardChanged = discardCount - lastDiscardCount > 0;
lastDiscardCount = discardCount;
try {
boolean emptyWait = true;
if (createError != null
&& poolingCount == 0
&& !discardChanged) {
emptyWait = false;
}
if (emptyWait
&& asyncInit && createCount < initialSize) {
emptyWait = false;
}
if (emptyWait) {
// 必须存在线程等待,才创建连接
if (poolingCount >= notEmptyWaitThreadCount //
&& (!(keepAlive && activeCount + poolingCount < minIdle))
&& !isFailContinuous()
) {
empty.await();
}
// 防止创建超过maxActive数量的连接
if (activeCount + poolingCount >= maxActive) {
empty.await();
continue;
}
}
} catch (InterruptedException e) {
lastCreateError = e;
lastErrorTimeMillis = System.currentTimeMillis();
if (!closing) {
LOG.error("create connection Thread Interrupted, url: " + jdbcUrl, e);
}
break;
} finally {
lock.unlock();
}
PhysicalConnectionInfo connection = null;
try {
connection = createPhysicalConnection();
} catch (SQLException e) {
LOG.error("create connection SQLException, url: " + jdbcUrl + ", errorCode " + e.getErrorCode()
+ ", state " + e.getSQLState(), e);
errorCount++;
if (errorCount > connectionErrorRetryAttempts && timeBetweenConnectErrorMillis > 0) {
// fail over retry attempts
setFailContinuous(true);
if (failFast) {
lock.lock();
try {
notEmpty.signalAll();
} finally {
lock.unlock();
}
}
if (breakAfterAcquireFailure) {
break;
}
try {
Thread.sleep(timeBetweenConnectErrorMillis);
} catch (InterruptedException interruptEx) {
break;
}
}
} catch (RuntimeException e) {
LOG.error("create connection RuntimeException", e);
setFailContinuous(true);
continue;
} catch (Error e) {
LOG.error("create connection Error", e);
setFailContinuous(true);
break;
}
if (connection == null) {
continue;
}
boolean result = put(connection);
if (!result) {
JdbcUtils.close(connection.getPhysicalConnection());
LOG.info("put physical connection to pool failed.");
}
errorCount = 0; // reset errorCount
}
}
}
说明:
- 死循环 for (;;)
- 必须存在线程等待,才创建连接。
- 创超时不能超过maxActive数量的连接。
- DestoryConnectionThread
创建了一个死循环的任务,每过timeBetweenEvictionRunsMillis执行一次。public class DestroyConnectionThread extends Thread { public DestroyConnectionThread(String name){ super(name); this.setDaemon(true); } public void run() { initedLatch.countDown(); for (;;) { // 从前面开始删除 try { if (closed) { break; } if (timeBetweenEvictionRunsMillis > 0) { Thread.sleep(timeBetweenEvictionRunsMillis); } else { Thread.sleep(1000); // } if (Thread.interrupted()) { break; } destroyTask.run(); } catch (InterruptedException e) { break; } } } }
其实真执行回收的以下的方法。
public void shrink(boolean checkTime, boolean keepAlive) {
try {
lock.lockInterruptibly();
} catch (InterruptedException e) {
return;
}
boolean needFill = false;
int evictCount = 0;
int keepAliveCount = 0;
try {
if (!inited) {
return;
}
final int checkCount = poolingCount - minIdle;
final long currentTimeMillis = System.currentTimeMillis();
for (int i = 0; i < poolingCount; ++i) {
DruidConnectionHolder connection = connections[i];
if (checkTime) {
if (phyTimeoutMillis > 0) {
long phyConnectTimeMillis = currentTimeMillis - connection.connectTimeMillis;
if (phyConnectTimeMillis > phyTimeoutMillis) {
evictConnections[evictCount++] = connection;
continue;
}
}
long idleMillis = currentTimeMillis - connection.lastActiveTimeMillis;
if (idleMillis < minEvictableIdleTimeMillis
&& idleMillis < keepAliveBetweenTimeMillis
) {
break;
}
if (idleMillis >= minEvictableIdleTimeMillis) {
if (checkTime && i < checkCount) {
evictConnections[evictCount++] = connection;
continue;
} else if (idleMillis > maxEvictableIdleTimeMillis) {
evictConnections[evictCount++] = connection;
continue;
}
}
if (keepAlive && idleMillis >= keepAliveBetweenTimeMillis) {
keepAliveConnections[keepAliveCount++] = connection;
}
} else {
if (i < checkCount) {
evictConnections[evictCount++] = connection;
} else {
break;
}
}
}
int removeCount = evictCount + keepAliveCount;
if (removeCount > 0) {
System.arraycopy(connections, removeCount, connections, 0, poolingCount - removeCount);
Arrays.fill(connections, poolingCount - removeCount, poolingCount, null);
poolingCount -= removeCount;
}
keepAliveCheckCount += keepAliveCount;
if (keepAlive && poolingCount + activeCount < minIdle) {
needFill = true;
}
} finally {
lock.unlock();
}
if (evictCount > 0) {
for (int i = 0; i < evictCount; ++i) {
DruidConnectionHolder item = evictConnections[i];
Connection connection = item.getConnection();
JdbcUtils.close(connection);
destroyCountUpdater.incrementAndGet(this);
}
Arrays.fill(evictConnections, null);
}
if (keepAliveCount > 0) {
// keep order
for (int i = keepAliveCount - 1; i >= 0; --i) {
DruidConnectionHolder holer = keepAliveConnections[i];
Connection connection = holer.getConnection();
holer.incrementKeepAliveCheckCount();
boolean validate = false;
try {
this.validateConnection(connection);
validate = true;
} catch (Throwable error) {
if (LOG.isDebugEnabled()) {
LOG.debug("keepAliveErr", error);
}
// skip
}
boolean discard = !validate;
if (validate) {
holer.lastKeepTimeMillis = System.currentTimeMillis();
boolean putOk = put(holer, 0L);
if (!putOk) {
discard = true;
}
}
if (discard) {
try {
connection.close();
} catch (Exception e) {
// skip
}
lock.lock();
try {
discardCount++;
if (activeCount + poolingCount <= minIdle) {
emptySignal();
}
} finally {
lock.unlock();
}
}
}
this.getDataSourceStat().addKeepAliveCheckCount(keepAliveCount);
Arrays.fill(keepAliveConnections, null);
}
if (needFill) {
lock.lock();
try {
int fillCount = minIdle - (activeCount + poolingCount + createTaskCount);
for (int i = 0; i < fillCount; ++i) {
emptySignal();
}
} finally {
lock.unlock();
}
}
}
说明:
- 空闲时间还没有超最小生存时间(minEvictableIdleTimeMillis)时,是不会回收的。
- 空闲时间超过最小生存时间是并不会全部回收,只会回收前poolingCount - minIdle,minIdle数量暂时不会回收。
- 当空闲大于最大生存时间时(maxEvictableIdleTimeMillis)时,由客户端口全部回收,druid默认的maxEvictableIdleTimeMillis为7小时。
- 当服务端的超时时间大于配置的wait_timeout时间时会由服务器全部断开超时链接,不管客端的情况。
- keepAlive保持链接的逻辑也在这代码中体现。
3.3、客户端超时验证机制
超时验证机制是指客户端拿到链接时,只要当时时间与链接最后活动时间的差大于检测间隔时间(即currentTimeMillis - lastActiveTimeMillis > timeBetweenEvictionRunsMillis)则会发起链接检测,执行testConnectionInternal检测如代码:
public DruidPooledConnection getConnectionDirect(long maxWaitMillis)
此处省略其它代码
if (testWhileIdle) {
final DruidConnectionHolder holder = poolableConnection.holder;
long currentTimeMillis = System.currentTimeMillis();
long lastActiveTimeMillis = holder.lastActiveTimeMillis;
long lastKeepTimeMillis = holder.lastKeepTimeMillis;
if (lastKeepTimeMillis > lastActiveTimeMillis) {
lastActiveTimeMillis = lastKeepTimeMillis;
}
long idleMillis = currentTimeMillis - lastActiveTimeMillis;
long timeBetweenEvictionRunsMillis = this.timeBetweenEvictionRunsMillis;
if (timeBetweenEvictionRunsMillis <= 0) {
timeBetweenEvictionRunsMillis = DEFAULT_TIME_BETWEEN_EVICTION_RUNS_MILLIS;
}
if (idleMillis >= timeBetweenEvictionRunsMillis
|| idleMillis < 0 // unexcepted branch
) {
boolean validate = testConnectionInternal(poolableConnection.holder, poolableConnection.conn);
if (!validate) {
if (LOG.isDebugEnabled()) {
LOG.debug("skip not validate connection.");
}
discardConnection(realConnection);
continue;
}
}
}
}
说明:在testOnBorrow=false的情况下,这个参数不要轻易配置,否则浪费性能。
Mysql的检测机制有两种,两种机制理论上能达到相同的效果。在执行的时间都能达到续约的效果(即执行后能重置lastActiveTimeMillis为当前执行的时间)
- ping
druid默认是ping机制,所以默认时配置validationQuery参数是无效的。如果要改变则进程的启动参数中(jvm参数)设置-Ddruid.mysql.usePingMethod=false即可。 - 查询SQL(select 1)
使用查询的机制来检测链接的可用性更的保险。
代码如下:
public boolean isValidConnection(Connection conn, String validateQuery, int validationQueryTimeout) throws Exception {
if (conn.isClosed()) {
return false;
}
if (usePingMethod) {
if (conn instanceof DruidPooledConnection) {
conn = ((DruidPooledConnection) conn).getConnection();
}
if (conn instanceof ConnectionProxy) {
conn = ((ConnectionProxy) conn).getRawObject();
}
if (clazz.isAssignableFrom(conn.getClass())) {
if (validationQueryTimeout <= 0) {
validationQueryTimeout = DEFAULT_VALIDATION_QUERY_TIMEOUT;
}
try {
ping.invoke(conn, true, validationQueryTimeout * 1000);
} catch (InvocationTargetException e) {
Throwable cause = e.getCause();
if (cause instanceof SQLException) {
throw (SQLException) cause;
}
throw e;
}
return true;
}
}
String query = validateQuery;
if (validateQuery == null || validateQuery.isEmpty()) {
query = DEFAULT_VALIDATION_QUERY;
}
Statement stmt = null;
ResultSet rs = null;
try {
stmt = conn.createStatement();
if (validationQueryTimeout > 0) {
stmt.setQueryTimeout(validationQueryTimeout);
}
rs = stmt.executeQuery(query);
return true;
} finally {
JdbcUtils.close(rs);
JdbcUtils.close(stmt);
}
}
四、结论
1、理论上是不会出来问题的,因为空闲时间只要大于等于timeBetweenEvictionRunsMillis时间会验测出来,则timeBetweenEvictionRunsMillis=60秒,远小于MYSQL的wait_timeout=600秒。除非mysql-connect-java,在默认ping的机制有不稳定性因素。
2、可能网络抖动在执行验证时就已失败,或非链接断开原因。
五、解决方案
1、配置maxEvictableIdleTimeMillis=500000(500秒)
默认情况下maxEvictableIdleTimeMillis=25200000L (即7小时),因为数据库的超时时间从8小时改为600秒,为减少风险理应该由客户端在空闲时主动关闭链接。而非超时后由mysql服务器端把链接关闭。
这个值的合理再应该根据公式:
maxEvictableIdleTimeMillis+timeBetweenEvictionRunsMillis
由客户端主动释放超时链接。
2、配置keepAlive=true
打开KeepAlive之后的效果:
- 初始化连接池时会填充到minIdle数量。
- 连接池中的minIdle数量以内的连接,空闲时间超过minEvictableIdleTimeMillis,则会执行keepAlive操作。
- 当网络断开等原因产生的由ExceptionSorter检测出来的死连接被清除后,自动补充连接到minIdle数量。
KeepAlive的最大作用是以minIdle数量自动继租,无论服务器端还是客户端都无法释放超时链接(因为不会超时)。
KeepAlive的使用条件是:建议使用druid 1.1.16或者更高版本。
详细文档见官网:KeepAlive配置
3、mysql服务端改回wait_timeout= 28800(8小时)
因为默认情况下客户端最大存活时间maxEvictableIdleTimeMillis为7小时,所以在服务器断开链接前,由客户端主动释放超时链接。(类似解决方案一)。
4、建议升级mysql-connect-java 5.1.48版本。
因为Mysql数据库的与驱动包版本的兼容性可能存在问题,升级新的版本解决了很多存在的BUG,具体的解决哪个问题可以在官方文档。