接到客户通知3月30日晚上19点40左右中间件tuxedo异常终止,由于该生产系统实时性要求极高,虽然客户通过重启tuxedo解决了该问题,但适逢世博会开幕前夕,领导非常重视,于是前往现场诊断原因。
到了现场发现该客户环境为aix 5308,ha主备,根据上头文件精神,数据库由A机切换至B机执行。查看Oracle alert日志显示:
引用
Tue Mar 30 19:49:22 2010
WARNING: inbound connection timed out (ORA-3136)
Tue Mar 30 19:49:22 2010
WARNING: inbound connection timed out (ORA-3136)
Tue Mar 30 19:49:40 2010
WARNING: inbound connection timed out (ORA-3136)
Tue Mar 30 19:49:43 2010
WARNING: inbound connection timed out (ORA-3136)
sqlnet.ora日志显示,为了保护客户隐私,将ip隐去
引用
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for IBM/AIX RISC System/6000: Version 10.2.0.4.0 - Production
TCP/IP NT Protocol Adapter for IBM/AIX RISC System/6000: Version 10.2.0.4.0 - Production
Oracle Bequeath NT Protocol Adapter for IBM/AIX RISC System/6000: Version 10.2.0.4.0 - Production
Time: 30-MAR-2010 19:49:40
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=
ip)(PORT=34700))
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=
ip)(PORT=34710))
后台tuxedo日志大致意思为不能派生进程,导致异常终止。检查vmstat偶尔有交换产生。主机配置14G内存,SGA使用6G内存。进一步检查vmo参数
引用
# vmo -a|grep lru_file_repage
lru_file_repage = 1
# vmo -a|grep maxperm%
maxperm% = 80
# vmo -a|grep maxclient%
maxclient% = 80
检查A机vmo参数,发现已作优化
引用
# vmo -a|grep lru_file_repage
lru_file_repage = 0
# vmo -a|grep maxperm%
maxperm% = 20
# vmo -a|grep maxclient%
maxclient% = 20
其实根据IBM官方建议,只需将lru_file_repage置为0,阻止其计算行内存交换出去,并没有必要将maxperm%和 maxclient%置为20%,只需保留80%,即可。根据以上信息,可以大致推断出主机资源繁忙,导致tuxedo异常终止。询问客户得知,数据库在A机运行一直稳定,于是将B机参数和A机保持一致
引用
# vmo -p -o maxclient%=20
Setting maxclient% to 20 in nextboot file
Setting maxclient% to 20
# vmo -p -o maxperm%=20
Setting maxperm% to 20 in nextboot file
Setting maxperm% to 20
# vmo -p -o lru_file_repage=0
Setting lru_file_repage to 0 in nextboot file
Setting lru_file_repage to 0
修改之后,到目前为止系统一直运行稳定。metalink建议的方法,详见doc 119706.1,并没有采用。