前几天很悲剧,公司的oracle小型机据说是网卡驱动有问题,导致数据库整整挂了1个多小时,后来切换到了备份数据库上,可用性一下子跌倒了3个9。
在后续的某一天,凌晨3点进行了网卡升级,又从备份库重新切回主库的时,测试过程中发现dbcp的自动重连部分机器有问题。
整个过程出现了两种错误日志:
异常1:
Caused by: java.sql.SQLException: The Network Adapter could not establish the connection
at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:70)
at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:131)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:197)
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:525)
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:413)
at oracle.jdbc.driver.PhysicalConnection.(PhysicalConnection.java:508)
at oracle.jdbc.driver.T4CConnection.(T4CConnection.java:203)
at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:33)
at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:510)
at com.alibaba.china.jdbc.SimpleDriver.connect(SimpleDriver.java:101)
at org.apache.commons.dbcp.DriverConnectionFactory.createConnection(DriverConnectionFactory.java:38)
at org.apache.commons.dbcp.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:582)
at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1148)
at org.apache.commons.dbcp.AbandonedObjectPool.borrowObject(AbandonedObjectPool.java:79)
at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:106)
at org.apache.commons.dbcp.BasicDataSource.getConnection(BasicDataSource.java:1044)
at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:113)
at org.springframework.orm.ibatis.SqlMapClientTemplate.execute(SqlMapClientTemplate.java:190)
... 49 more
Caused by: oracle.net.ns.NetException: The Network Adapter could not establish the connection
at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:328)
at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:421)
at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:630)
at oracle.net.ns.NSProtocol.connect(NSProtocol.java:206)
at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:966)
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:292)
... 62 more
Caused by: java.net.ConnectException: Connection timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:525)
at java.net.Socket.connect(Socket.java:475)
at java.net.Socket.(Socket.java:372)
at java.net.Socket.(Socket.java:186)
at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:127)
at oracle.net.nt.ConnOption.connect(ConnOption.java:126)
at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:306)
... 67 more
异常2:
Caused by: org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114)
at org.apache.commons.dbcp.BasicDataSource.getConnection(BasicDataSource.java:1044)
at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:113)
at org.springframework.orm.ibatis.SqlMapClientTemplate.execute(SqlMapClientTemplate.java:190)
... 49 more
Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1134)
at org.apache.commons.dbcp.AbandonedObjectPool.borrowObject(AbandonedObjectPool.java:79)
at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:106)
... 52 more
下面再看一下,我们的dbcp配置: (关于dbcp的配置,可以先看下这两篇文章: dbcp基本配置和重连配置 , 解读dbcp自动重连那些事 )
${database.driver.username}
14
1
60000
14
1
true
180
60000
1800000
bigStringTryClob=true;clientEncoding=GBK;defaultRowPrefetch=50;serverEncoding=ISO-8859-1
这是oracle的配置,mysql配置时多了个validate配置。
问题分析
1. 从日志分析来看,对应异常1的只在前面出现了百来次之后,后续的请求都直接抛出了异常2。
2. 从页面使用来看,每次页面返回502错误,一般是等待了60s左右。 这个和maxWait参数配置的60s接近。
3. 当时验证的两组集群,一组业务的机器没问题,全都自动重链ok的。 另一组中,有一台预发布机也能自动重连,但就几台提供正式访问的机器自动重连一直失败.
4. 针对自动重连失败的机器,netstat -nat | grep 1521查看链接为0
通过这几个现象,和查看了dbcp的源码,大致的一个猜想:
1. 假设说原先的应用中active的链接数为maxActive
2. 在数据库进行关闭后,原先的连接池中的异常会拿到"Broken pipe"异常,从日志中分析也就只有几条,符合。
如果正在使用的连接会抛出:java.sql.SQLException: 无法从套接字读取更多的数据。
如果拿到链接正准备使用,会抛出: java.sql.SQLException: OALL8 处于不一致状态
3. 等这些链接,被淘汰完之后,新的请求会尝试调用makeObject()重新去创建链接,因为此时的数据库的服务已经关闭,(理论上应该是"Connection refused",直接是端口不可访问, 但这个估计和我们采用了虚拟ip做数据库热备有关,后面会进行虚拟ip到真实数据库的一个路由过程),所以最后这里抛出了“java.net.ConnectException: Connection timed out”
4. 因为数据库重启过程,机器始终是对外提供服务,而且流量还不小。所以很容易堆积这样正在执行makeObject()创建链接的请求,一旦超过了一个阀值后,后续的链接就会是一个block的过程,block时间到了60s后就直接抛出异常。到时具体看一下dbcp的源码
所以总结一下:
1. 数据库关闭一段时间后,请求阻塞在makeObject()创建链接过程,一直没有释放,导致后续的请求全都是处于block,根本没机会去makeObject()创建链接,即使这时数据库已经重启后恢复了,应用还是无法正常提供服务。
2. 一般的数据库临时性维护,比如重启调整配置,热备数据库切换,网络瞬间断开。这一类问题,基本上在下一次makeObject()时就可以创建好新的链接,不会处于一个长期的阻塞状态,所以自动重连基本没啥影响。
按照这样的解释,因自己能力有限,不能搭建这么一套热备的数据库环境,所以很难再次重现这样的情况。而且基本上数据库维护也就是2这样的情况,基本重链没啥问题。
看一下代码分析:
dbcp使用了common-pools,看一下每次从连接池中获取链接的过程:
public Object borrowObject() throws Exception {
long starttime = System.currentTimeMillis();
Latch latch = new Latch();
byte whenExhaustedAction;
long maxWait;
synchronized (this) {
// Get local copy of current config. Can't sync when used later as
// it can result in a deadlock. Has the added advantage that config
// is consistent for entire method execution
whenExhaustedAction = _whenExhaustedAction;
maxWait = _maxWait;
// Add this request to the queue
_allocationQueue.add(latch);
// Work the allocation queue, allocating idle instances and
// instance creation permits in request arrival order
allocate(); // 位置1
}
for(;;) {
synchronized (this) {
assertOpen();
}
// If no object was allocated from the pool above
if(latch.getPair() == null) {
// check if we were allowed to create one
if(latch.mayCreate()) {
// allow new object to be created
} else {
// the pool is exhausted
switch(whenExhaustedAction) {
case WHEN_EXHAUSTED_GROW:
省略.....
case WHEN_EXHAUSTED_FAIL:
省略.....
case WHEN_EXHAUSTED_BLOCK: // 位置5
try {
synchronized (latch) {
// Before we wait, make sure another thread didn't allocate us an object
// or permit a new object to be created
if (latch.getPair() == null && !latch.mayCreate()) {
if(maxWait <= 0) {
latch.wait();
} else {
// this code may be executed again after a notify then continue cycle
// so, need to calculate the amount of time to wait
final long elapsed = (System.currentTimeMillis() - starttime);
final long waitTime = maxWait - elapsed;
if (waitTime > 0)
{
latch.wait(waitTime);
}
}
} else {
break;
}
}
} catch(InterruptedException e) {
Thread.currentThread().interrupt();
throw e;
}
if(maxWait > 0 && ((System.currentTimeMillis() - starttime) >= maxWait)) {
synchronized(this) {
// Make sure allocate hasn't already assigned an object
// in a different thread or permitted a new object to be created
if (latch.getPair() == null && !latch.mayCreate()) {
// Remove latch from the allocation queue
_allocationQueue.remove(latch);
} else {
break;
}
}
throw new NoSuchElementException("Timeout waiting for idle object"); // 位置6
} else {
continue; // keep looping
}
default:
throw new IllegalArgumentException("WhenExhaustedAction property " + whenExhaustedAction +
" not recognized.");
}
}
}
xxxx
}
private synchronized void allocate() {
if (isClosed()) return;
// First use any objects in the pool to clear the queue
for (;;) {
if (!_pool.isEmpty() && !_allocationQueue.isEmpty()) { // 位置2
Latch latch = (Latch) _allocationQueue.removeFirst();
latch.setPair((ObjectTimestampPair) _pool.removeFirst());
_numInternalProcessing++;
synchronized (latch) {
latch.notify();
}
} else {
break;
}
}
// Second utilise any spare capacity to create new objects
for(;;) {
if((!_allocationQueue.isEmpty()) && (_maxActive < 0 || (_numActive + _numInternalProcessing) < _maxActive)) { // 位置3
Latch latch = (Latch) _allocationQueue.removeFirst();
latch.setMayCreate(true);
_numInternalProcessing++; // 位置4
synchronized (latch) {
latch.notify();
}
} else {
break;
}
}
}
说明:
位置1: 每个数据库请求都会先调用allocate()方法
位置2: 先检查当前空闲的数据库连接,如果有直接拿一个返回,客户端就直接拿到了链接,因为我们没配置testOnBorrow参数,所以客户端会拿到一个Broken pipe的链接
位置3: 当原先没有空闲的链接,检查一下条件: _numActive + _numInternalProcessing < _maxActive。 注意我们的maxActive设置了14。 注意这里的_numInternalProcessing参数
位置4: 增加_numInternalProcessing计数,表明当前正在处理中的连接数。后续在makeObject成功后,就会执行_numInternalProcessing--操作,numInternalProcessing的意即为当前正在处理的连接数,计算maxActive也将其计算在内
位置5:就是当位置3的条件无法满足时,说白了就是_numInternalProcessing = _maxActive = 14时,后续的请求都会进入WHEN_EXHAUSTED_BLOCK,进行maxWait时间的等待,超过时间也就出现了对应的异常2的内容。
看一下makeObject创建链接过程的一些timeout时间设置:
因为从上面的分析来看,造成无法自动重链的问题主要是makeObject进行了一个阻塞动作。是否可以设置合理的socket参数,一般我们知道的主要是两个参数: connectTimeout , readTimeout。
这一类timeout设置,都应该石由对应的链接driver驱动提供。
mysql : (com.mysql.jdbc.Driver)
方式1: 直接jdbc url中指定参数
jdbc:mysql://127.0.0.1:3066/test?connectTimeout=3000&socketTimeout=60000
方式2: 通过dbcp传递properties参数
..............
connectTimeout=3000&socketTimeout=60000
具体参见文档: http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html
oracle : (oracle.jdbc.OracleDriver) , 没有找到对应的官方文档,通过反编译源码找到配置项
oracle.net.nt.ConnStrategy中定义:
public void createSocketOptions(Properties paramProperties)
{
String str1 = null;
String str2 = null;
int i = 0;
Enumeration localEnumeration = paramProperties.keys();
while (localEnumeration.hasMoreElements())
{
str1 = (String)localEnumeration.nextElement();
if (str1.equalsIgnoreCase("TCP.NODELAY"))
{
i = 1;
str2 = paramProperties.getProperty("TCP.NODELAY").toUpperCase();
if (str2.equals("NO"))
this.socketOptions.put(new Integer(0), "NO");
else
this.socketOptions.put(new Integer(0), "YES");
}
else if (str1.equalsIgnoreCase("oracle.net.READ_TIMEOUT"))
{
str2 = paramProperties.getProperty("oracle.net.READ_TIMEOUT");
this.socketOptions.put(new Integer(3), str2);
}
else if (str1.equalsIgnoreCase("oracle.net.CONNECT_TIMEOUT"))
{
str2 = paramProperties.getProperty("oracle.net.CONNECT_TIMEOUT");
this.socketOptions.put(new Integer(2), str2);
}
else if (str1.equalsIgnoreCase("oracle.net.ssl_server_dn_match"))
{
str2 = paramProperties.getProperty("oracle.net.ssl_server_dn_match");
this.socketOptions.put(new Integer(4), str2);
}
else if (str1.equalsIgnoreCase("oracle.net.wallet_location"))
{
str2 = paramProperties.getProperty("oracle.net.wallet_location");
this.socketOptions.put(new Integer(5), str2);
}
else if (str1.equalsIgnoreCase("oracle.net.ssl_version"))
{
str2 = paramProperties.getProperty("oracle.net.ssl_version");
this.socketOptions.put(new Integer(6), str2);
}
else if (str1.equalsIgnoreCase("oracle.net.ssl_cipher_suites"))
{
str2 = paramProperties.getProperty("oracle.net.ssl_cipher_suites");
this.socketOptions.put(new Integer(7), str2);
}
else if (str1.equalsIgnoreCase("javax.net.ssl.keyStore"))
{
str2 = paramProperties.getProperty("javax.net.ssl.keyStore");
this.socketOptions.put(new Integer(8), str2);
}
else if (str1.equalsIgnoreCase("javax.net.ssl.keyStoreType"))
{
str2 = paramProperties.getProperty("javax.net.ssl.keyStoreType");
this.socketOptions.put(new Integer(9), str2);
}
else if (str1.equalsIgnoreCase("javax.net.ssl.keyStorePassword"))
{
str2 = paramProperties.getProperty("javax.net.ssl.keyStorePassword");
this.socketOptions.put(new Integer(10), str2);
}
else if (str1.equalsIgnoreCase("javax.net.ssl.trustStore"))
{
str2 = paramProperties.getProperty("javax.net.ssl.trustStore");
this.socketOptions.put(new Integer(11), str2);
}
else if (str1.equalsIgnoreCase("javax.net.ssl.trustStoreType"))
{
str2 = paramProperties.getProperty("javax.net.ssl.trustStoreType");
this.socketOptions.put(new Integer(12), str2);
}
else if (str1.equalsIgnoreCase("javax.net.ssl.trustStorePassword"))
{
str2 = paramProperties.getProperty("javax.net.ssl.trustStorePassword");
this.socketOptions.put(new Integer(13), str2);
}
else
{
if (!str1.equalsIgnoreCase("ssl.keyManagerFactory.algorithm"))
continue;
str2 = paramProperties.getProperty("ssl.keyManagerFactory.algorithm");
this.socketOptions.put(new Integer(14), str2);
}
}
if ((i == 0) && (!this.reuseOpt))
this.socketOptions.put(new Integer(0), "YES");
}
指定参数:
..............
oracle.net.CONNECT_TIMEOUT=3000&oracle.net.READ_TIMEOUT=60000
因为没法模拟热备的oracle环境,对虚ip机制也不了解。所以针对前面提到的makeObject针对数据库关闭时,不是返回"Connection refused",而是返回给我"Connection timed out"
最后看了下,PlainSocketImpl.c, socket的native的一些实现,苦于知识缺乏,以后有机会再研究一下。
异常列表:
if (connect_rv == JVM_IO_INTR) {
JNU_ThrowByName(env, JNU_JAVAIOPKG "InterruptedIOException",
"operation interrupted");
} else if (errno == EPROTO) {
NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "ProtocolException",
"Protocol error");
} else if (errno == ECONNREFUSED) {
NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "ConnectException",
"Connection refused");
} else if (errno == ETIMEDOUT) {
NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "ConnectException",
"Connection timed out");
} else if (errno == EHOSTUNREACH) {
NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "NoRouteToHostException",
"Host unreachable");
} else if (errno == EADDRNOTAVAIL) {
NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "NoRouteToHostException",
"Address not available");
} else if ((errno == EISCONN) || (errno == EBADF)) {
JNU_ThrowByName(env, JNU_JAVANETPKG "SocketException",
"Socket closed");
} else {
NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "SocketException", "connect failed");
}
return;
}
还有线上通过Btrace抓取了下,oracle驱动运行时的timeout参数设置:
@BTrace
public class OracleTimeoutTracer {
@OnMethod(clazz = "oracle.net.nt.ConnStrategy", method = "createSocketOptions", location = @Location(value = Kind.ENTRY))
public static void strategy(@Self Object self, Properties paramProperties) {
println(str(paramProperties));
}
@OnMethod(clazz = "oracle.net.nt.TcpNTAdapter", method = "connect", location = @Location(value = Kind.ENTRY))
public static void adapter(@Self Object self) {
Field socketTimeoutField = field("oracle.net.nt.TcpNTAdapter", "sockTimeout");
Field hostField = field("oracle.net.nt.TcpNTAdapter", "host");
Field portField = field("oracle.net.nt.TcpNTAdapter", "port");
int scoketTimeout = (Integer) get(socketTimeoutField, self);
int port = (Integer) get(portField, self);
String host = (String) get(hostField, self);
println(strcat(strcat(strcat(strcat(strcat("host : ", host), ", port : "), str(port)), " , socketTimeout : "),
str(scoketTimeout)));
}
}
输出结果:
{user=alibaba, password=xxxxx, bigStringTryClob=true, defaultRowPrefetch=50, protocol=thin}
host : 10.20.36.19, port : 1521 , socketTimeout : 0
貌似默认的超时时间为0,从oracle driver的代码来看,如果没有显示设置timeout参数,默认都是socket.connectTimeout = socket.setSoTimeout = 0;
socket connect连接时,对应的超时时间为0,按照文档描述是没有timeout限制。 但针对Ip,Port端口不可访问是立马就会知道是个"Connection refused",用telnet测试即可。
针对虚ip(ip漂移技术),下次有机会研究和验证下。