前几天很悲剧,公司的oracle小型机据说是网卡驱动有问题,导致数据库整整挂了1个多小时,后来切换到了备份数据库上,可用性一下子跌倒了3个9。
在后续的某一天,凌晨3点进行了网卡升级,又从备份库重新切回主库的时,测试过程中发现dbcp的自动重连部分机器有问题。
整个过程出现了两种错误日志:
异常1:
Caused by: java.sql.SQLException: The Network Adapter could not establish the connection at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:70) at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:131) at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:197) at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:525) at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:413) at oracle.jdbc.driver.PhysicalConnection.<init>(PhysicalConnection.java:508) at oracle.jdbc.driver.T4CConnection.<init>(T4CConnection.java:203) at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:33) at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:510) at com.alibaba.china.jdbc.SimpleDriver.connect(SimpleDriver.java:101) at org.apache.commons.dbcp.DriverConnectionFactory.createConnection(DriverConnectionFactory.java:38) at org.apache.commons.dbcp.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:582) at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1148) at org.apache.commons.dbcp.AbandonedObjectPool.borrowObject(AbandonedObjectPool.java:79) at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:106) at org.apache.commons.dbcp.BasicDataSource.getConnection(BasicDataSource.java:1044) at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:113) at org.springframework.orm.ibatis.SqlMapClientTemplate.execute(SqlMapClientTemplate.java:190) ... 49 more Caused by: oracle.net.ns.NetException: The Network Adapter could not establish the connection at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:328) at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:421) at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:630) at oracle.net.ns.NSProtocol.connect(NSProtocol.java:206) at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:966) at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:292) ... 62 more Caused by: java.net.ConnectException: Connection timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:525) at java.net.Socket.connect(Socket.java:475) at java.net.Socket.<init>(Socket.java:372) at java.net.Socket.<init>(Socket.java:186) at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:127) at oracle.net.nt.ConnOption.connect(ConnOption.java:126) at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:306) ... 67 more
异常2:
Caused by: org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114) at org.apache.commons.dbcp.BasicDataSource.getConnection(BasicDataSource.java:1044) at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:113) at org.springframework.orm.ibatis.SqlMapClientTemplate.execute(SqlMapClientTemplate.java:190) ... 49 more Caused by: java.util.NoSuchElementException: Timeout waiting for idle object at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1134) at org.apache.commons.dbcp.AbandonedObjectPool.borrowObject(AbandonedObjectPool.java:79) at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:106) ... 52 more
下面再看一下,我们的dbcp配置: (关于dbcp的配置,可以先看下这两篇文章: dbcp基本配置和重连配置 , 解读dbcp自动重连那些事 )
<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close"> <property name="driverClassName" value="oracle.jdbc.OracleDriver" /> <property name="url" value="${database.driver.url}" /> <property name="username"><value>${database.driver.username}</value></property> <property name="password" value="${database.driver.password}" /> <property name="maxActive"><value>14</value></property> <property name="initialSize"><value>1</value></property> <property name="maxWait"><value>60000</value></property> <property name="maxIdle"><value>14</value></property> <property name="minIdle"><value>1</value></property> <property name="removeAbandoned"><value>true</value></property> <property name="removeAbandonedTimeout"><value>180</value></property> <property name="timeBetweenEvictionRunsMillis"><value>60000</value></property> <property name="minEvictableIdleTimeMillis"><value>1800000</value></property> <property name="connectionProperties"><value>bigStringTryClob=true;clientEncoding=GBK;defaultRowPrefetch=50;serverEncoding=ISO-8859-1</value></property> </bean>
这是oracle的配置,mysql配置时多了个validate配置。
问题分析
1. 从日志分析来看,对应异常1的只在前面出现了百来次之后,后续的请求都直接抛出了异常2。
2. 从页面使用来看,每次页面返回502错误,一般是等待了60s左右。 这个和maxWait参数配置的60s接近。
3. 当时验证的两组集群,一组业务的机器没问题,全都自动重链ok的。 另一组中,有一台预发布机也能自动重连,但就几台提供正式访问的机器自动重连一直失败.
4. 针对自动重连失败的机器,netstat -nat | grep 1521查看链接为0
通过这几个现象,和查看了dbcp的源码,大致的一个猜想:
1. 假设说原先的应用中active的链接数为maxActive
2. 在数据库进行关闭后,原先的连接池中的异常会拿到"Broken pipe"异常,从日志中分析也就只有几条,符合。
如果正在使用的连接会抛出:java.sql.SQLException: 无法从套接字读取更多的数据。
如果拿到链接正准备使用,会抛出: java.sql.SQLException: OALL8 处于不一致状态
3. 等这些链接,被淘汰完之后,新的请求会尝试调用makeObject()重新去创建链接,因为此时的数据库的服务已经关闭,(理论上应该是"Connection refused",直接是端口不可访问, 但这个估计和我们采用了虚拟ip做数据库热备有关,后面会进行虚拟ip到真实数据库的一个路由过程),所以最后这里抛出了“java.net.ConnectException: Connection timed out”
4. 因为数据库重启过程,机器始终是对外提供服务,而且流量还不小。所以很容易堆积这样正在执行makeObject()创建链接的请求,一旦超过了一个阀值后,后续的链接就会是一个block的过程,block时间到了60s后就直接抛出异常。到时具体看一下dbcp的源码
所以总结一下:
1. 数据库关闭一段时间后,请求阻塞在makeObject()创建链接过程,一直没有释放,导致后续的请求全都是处于block,根本没机会去makeObject()创建链接,即使这时数据库已经重启后恢复了,应用还是无法正常提供服务。
2. 一般的数据库临时性维护,比如重启调整配置,热备数据库切换,网络瞬间断开。这一类问题,基本上在下一次makeObject()时就可以创建好新的链接,不会处于一个长期的阻塞状态,所以自动重连基本没啥影响。
按照这样的解释,因自己能力有限,不能搭建这么一套热备的数据库环境,所以很难再次重现这样的情况。而且基本上数据库维护也就是2这样的情况,基本重链没啥问题。
看一下代码分析:
dbcp使用了common-pools,看一下每次从连接池中获取链接的过程:
public Object borrowObject() throws Exception { long starttime = System.currentTimeMillis(); Latch latch = new Latch(); byte whenExhaustedAction; long maxWait; synchronized (this) { // Get local copy of current config. Can't sync when used later as // it can result in a deadlock. Has the added advantage that config // is consistent for entire method execution whenExhaustedAction = _whenExhaustedAction; maxWait = _maxWait; // Add this request to the queue _allocationQueue.add(latch); // Work the allocation queue, allocating idle instances and // instance creation permits in request arrival order allocate(); // 位置1 } for(;;) { synchronized (this) { assertOpen(); } // If no object was allocated from the pool above if(latch.getPair() == null) { // check if we were allowed to create one if(latch.mayCreate()) { // allow new object to be created } else { // the pool is exhausted switch(whenExhaustedAction) { case WHEN_EXHAUSTED_GROW: 省略..... case WHEN_EXHAUSTED_FAIL: 省略..... case WHEN_EXHAUSTED_BLOCK: // 位置5 try { synchronized (latch) { // Before we wait, make sure another thread didn't allocate us an object // or permit a new object to be created if (latch.getPair() == null && !latch.mayCreate()) { if(maxWait <= 0) { latch.wait(); } else { // this code may be executed again after a notify then continue cycle // so, need to calculate the amount of time to wait final long elapsed = (System.currentTimeMillis() - starttime); final long waitTime = maxWait - elapsed; if (waitTime > 0) { latch.wait(waitTime); } } } else { break; } } } catch(InterruptedException e) { Thread.currentThread().interrupt(); throw e; } if(maxWait > 0 && ((System.currentTimeMillis() - starttime) >= maxWait)) { synchronized(this) { // Make sure allocate hasn't already assigned an object // in a different thread or permitted a new object to be created if (latch.getPair() == null && !latch.mayCreate()) { // Remove latch from the allocation queue _allocationQueue.remove(latch); } else { break; } } throw new NoSuchElementException("Timeout waiting for idle object"); // 位置6 } else { continue; // keep looping } default: throw new IllegalArgumentException("WhenExhaustedAction property " + whenExhaustedAction + " not recognized."); } } } xxxx } private synchronized void allocate() { if (isClosed()) return; // First use any objects in the pool to clear the queue for (;;) { if (!_pool.isEmpty() && !_allocationQueue.isEmpty()) { // 位置2 Latch latch = (Latch) _allocationQueue.removeFirst(); latch.setPair((ObjectTimestampPair) _pool.removeFirst()); _numInternalProcessing++; synchronized (latch) { latch.notify(); } } else { break; } } // Second utilise any spare capacity to create new objects for(;;) { if((!_allocationQueue.isEmpty()) && (_maxActive < 0 || (_numActive + _numInternalProcessing) < _maxActive)) { // 位置3 Latch latch = (Latch) _allocationQueue.removeFirst(); latch.setMayCreate(true); _numInternalProcessing++; // 位置4 synchronized (latch) { latch.notify(); } } else { break; } } }
说明:
位置1: 每个数据库请求都会先调用allocate()方法
位置2: 先检查当前空闲的数据库连接,如果有直接拿一个返回,客户端就直接拿到了链接,因为我们没配置testOnBorrow参数,所以客户端会拿到一个Broken pipe的链接
位置3: 当原先没有空闲的链接,检查一下条件: _numActive + _numInternalProcessing < _maxActive。 注意我们的maxActive设置了14。 注意这里的_numInternalProcessing参数
位置4: 增加_numInternalProcessing计数,表明当前正在处理中的连接数。后续在makeObject成功后,就会执行_numInternalProcessing--操作,numInternalProcessing的意即为当前正在处理的连接数,计算maxActive也将其计算在内
位置5:就是当位置3的条件无法满足时,说白了就是_numInternalProcessing = _maxActive = 14时,后续的请求都会进入WHEN_EXHAUSTED_BLOCK,进行maxWait时间的等待,超过时间也就出现了对应的异常2的内容。
看一下makeObject创建链接过程的一些timeout时间设置:
因为从上面的分析来看,造成无法自动重链的问题主要是makeObject进行了一个阻塞动作。是否可以设置合理的socket参数,一般我们知道的主要是两个参数: connectTimeout , readTimeout。
这一类timeout设置,都应该石由对应的链接driver驱动提供。
mysql : (com.mysql.jdbc.Driver)
方式1: 直接jdbc url中指定参数
jdbc:mysql://127.0.0.1:3066/test?connectTimeout=3000&socketTimeout=60000
方式2: 通过dbcp传递properties参数
<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close"> .............. <property name="connectionProperties"><value>connectTimeout=3000&socketTimeout=60000</value></property> </bean>
具体参见文档: http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html
oracle : (oracle.jdbc.OracleDriver) , 没有找到对应的官方文档,通过反编译源码找到配置项
oracle.net.nt.ConnStrategy中定义:
public void createSocketOptions(Properties paramProperties) { String str1 = null; String str2 = null; int i = 0; Enumeration localEnumeration = paramProperties.keys(); while (localEnumeration.hasMoreElements()) { str1 = (String)localEnumeration.nextElement(); if (str1.equalsIgnoreCase("TCP.NODELAY")) { i = 1; str2 = paramProperties.getProperty("TCP.NODELAY").toUpperCase(); if (str2.equals("NO")) this.socketOptions.put(new Integer(0), "NO"); else this.socketOptions.put(new Integer(0), "YES"); } else if (str1.equalsIgnoreCase("oracle.net.READ_TIMEOUT")) { str2 = paramProperties.getProperty("oracle.net.READ_TIMEOUT"); this.socketOptions.put(new Integer(3), str2); } else if (str1.equalsIgnoreCase("oracle.net.CONNECT_TIMEOUT")) { str2 = paramProperties.getProperty("oracle.net.CONNECT_TIMEOUT"); this.socketOptions.put(new Integer(2), str2); } else if (str1.equalsIgnoreCase("oracle.net.ssl_server_dn_match")) { str2 = paramProperties.getProperty("oracle.net.ssl_server_dn_match"); this.socketOptions.put(new Integer(4), str2); } else if (str1.equalsIgnoreCase("oracle.net.wallet_location")) { str2 = paramProperties.getProperty("oracle.net.wallet_location"); this.socketOptions.put(new Integer(5), str2); } else if (str1.equalsIgnoreCase("oracle.net.ssl_version")) { str2 = paramProperties.getProperty("oracle.net.ssl_version"); this.socketOptions.put(new Integer(6), str2); } else if (str1.equalsIgnoreCase("oracle.net.ssl_cipher_suites")) { str2 = paramProperties.getProperty("oracle.net.ssl_cipher_suites"); this.socketOptions.put(new Integer(7), str2); } else if (str1.equalsIgnoreCase("javax.net.ssl.keyStore")) { str2 = paramProperties.getProperty("javax.net.ssl.keyStore"); this.socketOptions.put(new Integer(8), str2); } else if (str1.equalsIgnoreCase("javax.net.ssl.keyStoreType")) { str2 = paramProperties.getProperty("javax.net.ssl.keyStoreType"); this.socketOptions.put(new Integer(9), str2); } else if (str1.equalsIgnoreCase("javax.net.ssl.keyStorePassword")) { str2 = paramProperties.getProperty("javax.net.ssl.keyStorePassword"); this.socketOptions.put(new Integer(10), str2); } else if (str1.equalsIgnoreCase("javax.net.ssl.trustStore")) { str2 = paramProperties.getProperty("javax.net.ssl.trustStore"); this.socketOptions.put(new Integer(11), str2); } else if (str1.equalsIgnoreCase("javax.net.ssl.trustStoreType")) { str2 = paramProperties.getProperty("javax.net.ssl.trustStoreType"); this.socketOptions.put(new Integer(12), str2); } else if (str1.equalsIgnoreCase("javax.net.ssl.trustStorePassword")) { str2 = paramProperties.getProperty("javax.net.ssl.trustStorePassword"); this.socketOptions.put(new Integer(13), str2); } else { if (!str1.equalsIgnoreCase("ssl.keyManagerFactory.algorithm")) continue; str2 = paramProperties.getProperty("ssl.keyManagerFactory.algorithm"); this.socketOptions.put(new Integer(14), str2); } } if ((i == 0) && (!this.reuseOpt)) this.socketOptions.put(new Integer(0), "YES"); }
指定参数:
<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close"> .............. <property name="connectionProperties"><value>oracle.net.CONNECT_TIMEOUT=3000&oracle.net.READ_TIMEOUT=60000</value></property> </bean>
因为没法模拟热备的oracle环境,对虚ip机制也不了解。所以针对前面提到的makeObject针对数据库关闭时,不是返回"Connection refused",而是返回给我"Connection timed out"
最后看了下,PlainSocketImpl.c, socket的native的一些实现,苦于知识缺乏,以后有机会再研究一下。
异常列表:
if (connect_rv == JVM_IO_INTR) { JNU_ThrowByName(env, JNU_JAVAIOPKG "InterruptedIOException", "operation interrupted"); } else if (errno == EPROTO) { NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "ProtocolException", "Protocol error"); } else if (errno == ECONNREFUSED) { NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "ConnectException", "Connection refused"); } else if (errno == ETIMEDOUT) { NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "ConnectException", "Connection timed out"); } else if (errno == EHOSTUNREACH) { NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "NoRouteToHostException", "Host unreachable"); } else if (errno == EADDRNOTAVAIL) { NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "NoRouteToHostException", "Address not available"); } else if ((errno == EISCONN) || (errno == EBADF)) { JNU_ThrowByName(env, JNU_JAVANETPKG "SocketException", "Socket closed"); } else { NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "SocketException", "connect failed"); } return; }
还有线上通过Btrace抓取了下,oracle驱动运行时的timeout参数设置:
@BTrace public class OracleTimeoutTracer { @OnMethod(clazz = "oracle.net.nt.ConnStrategy", method = "createSocketOptions", location = @Location(value = Kind.ENTRY)) public static void strategy(@Self Object self, Properties paramProperties) { println(str(paramProperties)); } @OnMethod(clazz = "oracle.net.nt.TcpNTAdapter", method = "connect", location = @Location(value = Kind.ENTRY)) public static void adapter(@Self Object self) { Field socketTimeoutField = field("oracle.net.nt.TcpNTAdapter", "sockTimeout"); Field hostField = field("oracle.net.nt.TcpNTAdapter", "host"); Field portField = field("oracle.net.nt.TcpNTAdapter", "port"); int scoketTimeout = (Integer) get(socketTimeoutField, self); int port = (Integer) get(portField, self); String host = (String) get(hostField, self); println(strcat(strcat(strcat(strcat(strcat("host : ", host), ", port : "), str(port)), " , socketTimeout : "), str(scoketTimeout))); } }
输出结果:
{user=alibaba, password=xxxxx, bigStringTryClob=true, defaultRowPrefetch=50, protocol=thin} host : 10.20.36.19, port : 1521 , socketTimeout : 0
貌似默认的超时时间为0,从oracle driver的代码来看,如果没有显示设置timeout参数,默认都是socket.connectTimeout = socket.setSoTimeout = 0;
socket connect连接时,对应的超时时间为0,按照文档描述是没有timeout限制。 但针对Ip,Port端口不可访问是立马就会知道是个"Connection refused",用telnet测试即可。
针对虚ip(ip漂移技术),下次有机会研究和验证下。