有套数据库环境一直报警无法连接,但立马又报警恢复正常,查看了下alert日志,大量如下信息:
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for Linux: Version 12.2.0.1.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 12.2.0.1.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 12.2.0.1.0 - Production
Time: 08-OCT-2018 23:17:48
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS: operation timed out
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
...
WARNING: inbound connection timed out (ORA-3136)
-----------------------------------------------------------------------------------
这个报错信息很常见,一直以来都觉得是应用服务器和数据库服务器之间存在防火墙,超时后将连接关闭导致的。
但这次的告警是这几天刚出现的,并且该系统已经上线一年多了。最近在网络、应用、数据库端都没有做过变更,所以怀疑存在其他问题。
-------------------------------------------------------------------------------------
在MOS上搜了下,找到一篇文档2469518.1
问题描述与我报错信息符合,给出解释如下:
JDBC 11g and later connection establishment with databases use a connection mechanism (o5logon) that requires the use of random numbers. These numbers are typically generated by a special device (/dev/random). However, this random number generator relies on entropy in order to generate sufficiently random numbers. This entropy comes from things like mouse pointer movement and keyboard entry. When there is insufficient entropy, the random number generator will not return any numbers. When this happens, the o5logon used by the JDBC library stalls, and has to wait until sufficient entropy is available. This can cause connections to be reset or refused. On Linux boxes, this has been observed to result in the "connection reset" error.
大致意思jdbc 11g及以后的版本,oracle安全连接使用操作系统/dev/random产生的随机数,但/dev/random依赖于系统中断,当系统中断不足的时候,则无法获取随机数,此时程序要么挂起,要么中断。
故而,当系统中断不足,/dev/random取不到随机数,oracle建立连接失败。
MOS给的解决方案:
1.The following java options have proven successful in these cases:
-Djava.security.egd=file:/dev/urandom and -Dsecurerandom.source=file:/dev/./random.
TNS timeout errors and ORA-3136 error you are experiencing point to /dev/random issue known to happen on Linux severs and causes Timeout exceptions. These settings are for the use /dev/urandom instead of using /dev/random which has the entropy fault.
2. Also, the server receives a valid client connection request but the client takes a long time to authenticate more than the default 60 seconds set by by the parameter SQLNET.INBOUND_CONNECT_TIMEOUT. Please see Document 465043.1Troubleshooting Guide ORA-3136: WARNING Inbound Connection Timed Out for details. Increasing this parameter provides more time for the authentication to complete.
3. As a workaround, change the security logging to write to file instead of sending it to the database (application change).
---------------------------------------------------------------------------------------
第一种方法可行性最好。
-------------------------------------
之前没有关注过/dev/random和/dev/urandom,新知识了。
-------------------------------------------------------------------------------------------------
2020-07-14 疯狂打脸
原来开发那边最近有过新上线,并且该报错就是从新上线时产生的。后来我查看了下listener.log,发现自从新上线程序后,数据库连接每分钟有4、5千,明显有问题。而之前每分钟只有两三个,才是正常。
开发那边发现有台应用服务器好像死机了,不知道这个是不是造成应用频繁连接数据库的原因,重启后连接就恢复正常了。
-----------------------------------------
简单复盘下,其实上面MOS的解释应该也是对的,库上存在大量的连接要建立,但是由于系统/dev/random产生的随机数不够,会出现连接建立失败的情况。alert日志里面也会报出相应的错误。
但我们的思路不能停留于此,而是要找出为何有大量连接需要创建的原因。
从另一个角度来考虑,目前大概有两百多套数据库,存在一定量的库确实比较闲的,可能会产生较少的随机数,但并没有出现连接报错的情况,一定程度上可以说明,系统默认的配置,产生的随机数,理论上是够用,除非发生较为极端的情况,比如我这出现的大量的连接请求。
所以,有时候很多问题,还是先从个例去考虑,它发生了哪些变化一定要摸清楚。