在一次版本线上慢查询事件中,发现大量的Communications link failure 异常
我将这种异常分为二类:
1.低于120秒,莫名原因导致 mysql server端强杀
2.超过120秒(公司设定mysql server端120秒强杀)
项目采用的数据层分库分表的技术组件:
当当网的ShardingJdbc
问题场景复现及解决方案
1.低于120秒,莫名原因导致 mysql server端强杀
本地复盘
按照线上的情况,使用了shardingJdbc的shardingApi(当前有128库,就会并行128个线程去执行),查询返回结果集总和大于几万的数据,并且此API是在进程中并发较高,在本机使用了3个线程,然后使用线程池(最高3个,实际需要执行的就是3*128=384个线程),死循环的不断运行,没过多久,在业务日志中就出现大量的Communications link failure,具体异常堆栈如下:
[2018-05-02 02:09:30.014] [WARN] [ShardingJDBC-2] [com.xxx.yyyyy.zzzzzz.pool.ProxyConnection] >>> node_10 - Connection com.mysql.jdbc.JDBC4Connection@46e7894e marked as broken because of SQLSTATE(08S01), ErrorCode(0)
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failureThe last packet successfully received from the server was 34 milliseconds ago. The last packet sent successfully to the server was 6,481 milliseconds ago.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:989)
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2229)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1989)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3410)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3112)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2341)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2736)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2494)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1966)
at com.xxx.yyyyy.zzzzzz.pool.ProxyPreparedStatement.executeQuery(ProxyPreparedStatement.java:36)
at com.xxx.yyyyy.zzzzzz.pool.CaelusProxyPreparedStatement.executeQuery(CaelusProxyPreparedStatement.java)
at com.dangdang.ddframe.rdb.sharding.executor.PreparedStatementExecutor.executeQueryInternal(PreparedStatementExecutor.java:89)
at com.dangdang.ddframe.rdb.sharding.executor.PreparedStatementExecutor.access$000(PreparedStatementExecutor.java:39)
at com.dangdang.ddframe.rdb.sharding.executor.PreparedStatementExecutor$1.execute(PreparedStatementExecutor.java:73)
at com.dangdang.ddframe.rdb.sharding.executor.PreparedStatementExecutor$1.execute(PreparedStatementExecutor.java:68)
at com.dangdang.ddframe.rdb.sharding.executor.ExecutorEngine$1.call(ExecutorEngine.java:114)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException: Can not read response from server. Expected to read 6 bytes, read 2 bytes before connection was unexpectedly lost.
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3011)
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2212)
... 22 common frames omitted
在云主机的mysql server端的error日志也对应的出现大量的Got an error reading communication packets和Got timeout writing communication packets 信息(xxx为规避公司安全信息内容)
2018-04-28T02:29:00.358697Z 6217 [Note] Aborted connection 6217 to db: 'xxx_006' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got an error reading communication packets)
2018-04-28T02:29:00.358720Z 6235 [Note] Aborted connection 6235 to db: 'xxx_001' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got an error reading communication packets)
2018-04-28T02:29:00.359237Z 6232 [Note] Aborted connection 6232 to db: 'xxx_001' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got an error reading communication packets)
2018-04-28T02:29:00.359242Z 6215 [Note] Aborted connection 6215 to db: xxx_030' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got an error reading communication packets)
2018-04-28T02:29:00.359656Z 6251 [Note] Aborted connection 6251 to db: 'xxx_018' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got an error reading communication packets)
2018-04-28T02:29:00.360229Z 6234 [Note] Aborted connection 6234 to db: 'xxx_005' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got an error reading communication packets)
2018-04-28T02:29:00.360271Z 6233 [Note] Aborted connection 6233 to db: 'xxx_028' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got an error reading communication packets)
2018-04-28T02:29:00.360572Z 6241 [Note] Aborted connection 6241 to db: 'xxx_009' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got an error reading communication packets)
2018-04-28T02:29:00.361019Z 6214 [Note] Aborted connection 6214 to db: 'xxx_025' user: 'xxx' host: '10.100.69.231' (Got an error reading communication packets)
2018-04-28T02:29:00.361023Z 6195 [Note] Aborted connection 6195 to db: 'xxx_031' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got an error reading communication packets)
2018-04-28T02:29:00.358721Z 6250 [Note] Aborted connection 6250 to db: 'xxx_027' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got an error reading communication packets)
2018-04-28T02:29:00.343058Z 6193 [Note] Aborted connection 6193 to db: 'xxx_013' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got an error reading communication packets)
2018-04-28T02:29:00.361708Z 6225 [Note] Aborted connection 6225 to db: 'xxx_011' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got an error reading communication packets)
2018-04-28T02:32:09.956824Z 6797 [Note] Aborted connection 6797 to db: 'xxx_016' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got timeout writing communication packets)
2018-04-28T02:32:10.071781Z 6800 [Note] Aborted connection 6800 to db: 'xxx_012' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got timeout writing communication packets)
2018-04-28T02:32:10.073718Z 6798 [Note] Aborted connection 6798 to db: 'xxx_007' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got timeout writing communication packets)
初步猜测 此异常是有关联的,根据mysql异常在网上搜索了一下
大致与mysql 配置的以下三个参数有关(此参数线上和云主机的数据库是一致的)
net_read_timeout=3
net_write_timeout=6
net_retry_count=2
附上参数说明:
net_read_timeout :数据读取超时时间。在终止读之前,从一个连接获得数据而等待的时间秒数;当服务正在从客户端读取数据时,net_read_timeout控制何时超时。即客户端执行数据读取,等待多少秒仍未执行成功时自动断开连接。
net_write_timeout:数据库写超时时间。和net_read_timeout意义类似,在终止写之前,等待多少秒把block写到连接;当服务正在写数据到客户端时,net_write_timeout控制何时超时。
同时也看了一篇关于mysql源码文章(可以直接看重点:3、调用vio虚拟I/O接口进行读取阶段)
http://blog.itpub.net/7728585/viewspace-2138631/
那么根据问题,尝试调整mysql server端的参数为
net_read_timeout=10
net_write_timeout=10
net_retry_count=5
再次运行本地复盘的程序,mysql服务端的error信息 有关Got an error reading communication packets和Got timeout writing communication packets 信息已消失,java应用日志中的低于120秒的Communications link failure 也消失了
总结:
修改mysqlserver端虽然可以解决问题,但本质还是在于程序上无止境的并行和大量读写操作,随着程序员无休止的增加并行,mysql server端不可能无脑的调整参数,就算去调了,可能会引发mysql server各种不稳定,那么在程序上就要进行限流,做到每个程序进程中的线程可控,分布式上整体的并行可控,程序上的SQL返回结果集大小控制在合理,控制不了继续拆解.
2.超过120秒(公司设定mysql server端120秒强杀)
此超过120秒的Communications link failure,是在解决完上面第1个问题,还残留的问题,但并不是同一个问题,我们都知道120秒,是公司在mysql server端定的一个标准wait_timeout 时间,过了这个120秒,服务端就无脑的强杀了闲置的连接,这个一般都在连接池的配置上,可以规避.
但我这边碰到的问题,其实已经有配连接池策略,但还是发生了,那么就说明,一定在程序中有连接被获取后,闲置了120秒以上再使用它,连接池caelus是可以排除的(这在无数的实践中都有证明,连接池在连接管理是稳定的),那么剩下的就只有ShardingJdbc和程序上不合理的代码了
但查看线上业务日志,此问题发生在不同的作业,那么问题肯定就在公共的代码中了
本地复盘
直接使用问题1的代码方式,既可重现,在执行程序后2到3分钟后,就开始出现120秒以上的Communications link failure,具体异常堆栈如下:
五月 01, 2018 10:21:37 下午 com.google.common.util.concurrent.AggregateFuture$RunningState handleException
严重: Got more than one input Future failure. Logging failures after the first
com.dangdang.ddframe.rdb.sharding.exception.ShardingJdbcException: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failureThe last packet successfully received from the server was 143,759 milliseconds ago. The last packet sent successfully to the server was 0 milliseconds ago.
at com.dangdang.ddframe.rdb.sharding.executor.ExecutorExceptionHandler.handleException(ExecutorExceptionHandler.java:61)
at com.dangdang.ddframe.rdb.sharding.executor.PreparedStatementExecutor.executeQueryInternal(PreparedStatementExecutor.java:92)
at com.dangdang.ddframe.rdb.sharding.executor.PreparedStatementExecutor.access$000(PreparedStatementExecutor.java:39)
at com.dangdang.ddframe.rdb.sharding.executor.PreparedStatementExecutor$1.execute(PreparedStatementExecutor.java:73)
at com.dangdang.ddframe.rdb.sharding.executor.PreparedStatementExecutor$1.execute(PreparedStatementExecutor.java:68)
at com.dangdang.ddframe.rdb.sharding.executor.ExecutorEngine$1.call(ExecutorEngine.java:114)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failureThe last packet successfully received from the server was 143,759 milliseconds ago. The last packet sent successfully to the server was 0 milliseconds ago.
at sun.reflect.GeneratedConstructorAccessor47.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:989)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3559)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3459)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3900)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2527)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2680)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2494)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1966)
at com.xxx.yyyyy.zzzzzz.pool.ProxyPreparedStatement.executeQuery(ProxyPreparedStatement.java:36)
at com.xxx.yyyyy.zzzzzz.pool.CaelusProxyPreparedStatement.executeQuery(CaelusProxyPreparedStatement.java)
at com.dangdang.ddframe.rdb.sharding.executor.PreparedStatementExecutor.executeQueryInternal(PreparedStatementExecutor.java:89)
... 10 more
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3011)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3469)
... 20 more五月 01, 2018 10:21:37 下午 com.google.common.util.concurrent.AggregateFuture$RunningState handleException
严重: Got more than one input Future failure. Logging failures after the first
com.dangdang.ddframe.rdb.sharding.exception.ShardingJdbcException: java.sql.SQLException: No operations allowed after statement closed.
at com.dangdang.ddframe.rdb.sharding.executor.ExecutorExceptionHandler.handleException(ExecutorExceptionHandler.java:61)
at com.dangdang.ddframe.rdb.sharding.executor.PreparedStatementExecutor.executeQueryInternal(PreparedStatementExecutor.java:92)
at com.dangdang.ddframe.rdb.sharding.executor.PreparedStatementExecutor.access$000(PreparedStatementExecutor.java:39)
at com.dangdang.ddframe.rdb.sharding.executor.PreparedStatementExecutor$1.execute(PreparedStatementExecutor.java:73)
at com.dangdang.ddframe.rdb.sharding.executor.PreparedStatementExecutor$1.execute(PreparedStatementExecutor.java:68)
at com.dangdang.ddframe.rdb.sharding.executor.ExecutorEngine$1.call(ExecutorEngine.java:114)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLException: No operations allowed after statement closed.
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:964)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:897)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:886)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:860)
at com.mysql.jdbc.StatementImpl.checkClosed(StatementImpl.java:442)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1919)
at com.xxx.yyyyy.zzzzzz.pool.ProxyPreparedStatement.executeQuery(ProxyPreparedStatement.java:36)
at com.xxx.yyyyy.zzzzzz.pool.CaelusProxyPreparedStatement.executeQuery(CaelusProxyPreparedStatement.java)
at com.dangdang.ddframe.rdb.sharding.executor.PreparedStatementExecutor.executeQueryInternal(PreparedStatementExecutor.java:89)
... 10 more
在mysql server端的error日志也对应的大量的Got timeout reading communication packets 信息(xxx为规避公司安全信息内容)
2018-04-28T00:21:11.006811Z 6605 [Note] Aborted connection 6605 to db: 'xxx_005' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got timeout reading communication packets)
2018-04-28T00:24:43.600432Z 6607 [Note] Aborted connection 6607 to db: 'xxx_026' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got timeout reading communication packets)
2018-04-28T00:29:46.082047Z 6609 [Note] Aborted connection 6609 to db: 'xxx_023' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got timeout reading communication packets)
2018-04-28T00:33:19.703812Z 6610 [Note] Aborted connection 6610 to db: 'xxx_017' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got timeout reading communication packets)
2018-04-28T00:36:49.871118Z 6612 [Note] Aborted connection 6612 to db: 'xxx_003' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got timeout reading communication packets)
2018-04-28T00:40:20.465173Z 6614 [Note] Aborted connection 6614 to db: 'xxx_024' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got timeout reading communication packets)
2018-04-28T00:45:20.750288Z 6616 [Note] Aborted connection 6616 to db: 'xxx_009' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got timeout reading communication packets)
2018-04-28T00:50:22.095151Z 6618 [Note] Aborted connection 6618 to db: 'xxx_000' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got timeout reading communication packets)
2018-04-28T00:53:52.137204Z 6620 [Note] Aborted connection 6620 to db: 'xxx_009' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got timeout reading communication packets)
2018-04-28T00:57:22.688387Z 6621 [Note] Aborted connection 6621 to db: 'xxx_021' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got timeout reading communication packets)
2018-04-28T01:04:29.575128Z 6625 [Note] Aborted connection 6625 to db: 'xxx_026' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got timeout reading communication packets)
2018-04-28T01:09:29.641212Z 6627 [Note] Aborted connection 6627 to db: 'xxx_024' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got timeout reading communication packets)
2018-04-28T01:12:59.751070Z 6628 [Note] Aborted connection 6628 to db: 'xxx_014' user: 'xxx' host: 'xxx.xxx.xxx.xxx' (Got timeout reading communication packets)
有了这些日志,首先看mysql端的日志,此Got timeout reading communication packets 平时也是持续稳定的输出,并不是因为程序出现问题才出现,可以得出,此是服务端在wait_timeout 配置的时间到了,强杀的对应日志,所以意义不大.
随后查看业务日志的异常堆栈,可以看出ShardingJdbc使用了线程池,然后根据堆栈查看源代码,发现ShardingJdbc在处理多线程归并逻辑时,是先产生获取连接的动作,再多线程去执行连接的Statement ,那么问题就来了,ShardingJdbc默认设定的是100的最大并行数,如果需要处理超过或者本机CPU较慢时,线程池就会排队,线程池在具体线程还没有运行前的connection就已经开始闲置,随时时间的推移,陆陆续续产生超过120秒的Communications link failure
附上查看ShardingJdbc关键源代码,大家可以通过上面的异常堆栈及下面的源代码互相对应
以下二段源代码,顺序也是执行的前后顺序
在ShardingPreparedStatement 167行已经开始产生连接
对应的类:
com\dangdang\sharding-jdbc-core\1.4.2\sharding-jdbc-core-1.4.2.jar!\com\dangdang\ddframe\rdb\sharding\jdbc\ShardingPreparedStatement.class
对应的行:PreparedStatement preparedStatement = (PreparedStatement) getStatement(getShardingConnection().getConnection(each.getDataSource(), sqlRouteResult.getSqlStatementType()), each.getSql());
在ExecutorEngine 110行已经开始多线程运行
com\dangdang\sharding-jdbc-core\1.4.2\sharding-jdbc-core-1.4.2.jar!\com\dangdang\ddframe\rdb\sharding\executor\ExecutorEngine.class
result.add(executorService.submit(new Callable() {
@Override
public O call() throws Exception {
return executeUnit.execute(each);
}
}));
如上可以看到如果按这样的玩法,再结合单进程内多线程的超越本机能力的并行,就会间断性的产生超过120秒的Communications link failure
总结:
按我的理解,这个应该算是ShardingJdbc的BUG
vipshop_ebs/朱杰
2018-12-24