记内存过度使用导致tuxedo异常终止,OracleTNS-12535

接到客户通知3月30日晚上19点40左右中间件tuxedo异常终止,由于该生产系统实时性要求极高,虽然客户通过重启tuxedo解决了该问题,但适逢世博会开幕前夕,领导非常重视,于是前往现场诊断原因。
到了现场发现该客户环境为aix 5308,ha主备,根据上头文件精神,数据库由A机切换至B机执行。查看Oracle alert日志显示:
引用
Tue Mar 30 19:49:22 2010
WARNING: inbound connection timed out (ORA-3136)
Tue Mar 30 19:49:22 2010
WARNING: inbound connection timed out (ORA-3136)
Tue Mar 30 19:49:40 2010
WARNING: inbound connection timed out (ORA-3136)
Tue Mar 30 19:49:43 2010
WARNING: inbound connection timed out (ORA-3136)

sqlnet.ora日志显示,为了保护客户隐私,将ip隐去
引用
Fatal NI connect error 12170.

  VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 10.2.0.4.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 10.2.0.4.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 10.2.0.4.0 - Production
  Time: 30-MAR-2010 19:49:40
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535
    TNS-12535: TNS:operation timed out
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST= ip)(PORT=34700))
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST= ip)(PORT=34710))


后台tuxedo日志大致意思为不能派生进程,导致异常终止。检查vmstat偶尔有交换产生。主机配置14G内存,SGA使用6G内存。进一步检查vmo参数
引用
# vmo -a|grep lru_file_repage
       lru_file_repage = 1
# vmo -a|grep maxperm%
              maxperm% = 80
# vmo -a|grep maxclient%
            maxclient% = 80

检查A机vmo参数,发现已作优化
引用
# vmo -a|grep lru_file_repage
       lru_file_repage = 0
# vmo -a|grep maxperm%
              maxperm% = 20
# vmo -a|grep maxclient%
            maxclient% = 20


其实根据IBM官方建议,只需将lru_file_repage置为0,阻止其计算行内存交换出去,并没有必要将maxperm%和 maxclient%置为20%,只需保留80%,即可。根据以上信息,可以大致推断出主机资源繁忙,导致tuxedo异常终止。询问客户得知,数据库在A机运行一直稳定,于是将B机参数和A机保持一致
引用
# vmo -p -o  maxclient%=20
Setting maxclient% to 20 in nextboot file
Setting maxclient% to 20
# vmo -p -o maxperm%=20
Setting maxperm% to 20 in nextboot file
Setting maxperm% to 20
# vmo -p -o lru_file_repage=0
Setting lru_file_repage to 0 in nextboot file
Setting lru_file_repage to 0


修改之后,到目前为止系统一直运行稳定。metalink建议的方法,详见doc 119706.1,并没有采用。

你可能感兴趣的:(oracle,中间件,IBM,OS,AIX)