2010年11月29日下午15点左右,p570a主机 telnet不进去,应用新建连接不成功,严重影响到业务,16点赶到用户现场,进行应急处理。
现把本次数据库应急故障处理、问题分析过程总结如下:
通过hmc控制台,登录到p570a主机,输入任何命令都报内存不足,如下;
root@p570a:/> errpt|more
ksh: 0403-031 The fork function failed. There is not enough memory available.
ksh: 0403-031 The fork function failed. There is not enough memory available.
root@p570a:/> ps -ef | grep LOCAL=NO|wc -l
ksh: 0403-031 The fork function failed. There is not enough memory available.
root@p570a:/> ls
ksh: 0403-031 The fork function failed. There is not enough memory available.
征求用户意见同意后,通过hmc控制台,重启p570a主机。
p570a@root#errpt|more
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
A6DF45AA 1129164210 I O RMCdaemon The daemon is started.
EC0BCCD4 1129164110 T H ent1 ETHERNET DOWN
67145A39 1129163910 U S SYSDUMP SYSTEM DUMP
F48137AC 1129163810 U O minidump COMPRESSED MINIMAL DUMP
1104AA28 1129163810 T S SYSPROC SYSTEM RESET INTERRUPT RECEIVED
9DBCFDEE 1129164110 T O errdemon ERROR LOGGING TURNED ON
B6267342 1126235510 P H hdisk3 DISK OPERATION ERROR
B6267342 1125235510 P H hdisk3 DISK OPERATION ERROR
C5C09FFA 1125062110 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1125051010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
C5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATED
p570a@root#errpt -aj C5C09FFA |more
---------------------------------------------------------------------------
LABEL: PGSP_KILL
IDENTIFIER: C5C09FFA
Date/Time: Thu Nov 25 06:21:13 BEIST 2010
Sequence Number: 99122
Machine Id: 00C6E9C54C00
Node Id: p570a
Class: S
Type: PERM
WPAR: Global
Resource Name: SYSVMM
Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED
Probable Causes
SYSTEM RUNNING OUT OF PAGING SPACE
Failure Causes
INSUFFICIENT PAGING SPACE DEFINED FOR THE SYSTEM
PROGRAM USING EXCESSIVE AMOUNT OF PAGING SPACE
11月24号开始已经报没有足够的页面交换空间可以使用,可见物理内存早就用完。
alert_gzjb1.log从11月24号开始就有大量如下报错:
Wed Nov 24 22:36:15 2010
ORA-27302: failure occurred at: skgpspawn3
ORA-27301: OS failure message: Not enough space
ORA-27300: OS system dependent operation:fork failed with status: 12
Errors in file /oracle/app/oracle/diag/rdbms/gdjb/gdjb1/trace/gdjb1_psp0_352314.trc:
Process startup failed, error stack:
Thu Nov 25 02:56:24 2010
Process q000 died, see its trace file
Thu Nov 25 02:56:13 2010
ORA-27302: failure occurred at: skgpspawn3
ORA-27301: OS failure message: Not enough space
ORA-27300: OS system dependent operation:fork failed with status: 12
Errors in file /oracle/app/oracle/diag/rdbms/gdjb/gdjb1/trace/gdjb1_psp0_352314.trc:
Process startup failed, error stack:
Instance terminated by USER, pid = 144242
USER (ospid: 144242): terminating the instance due to error 443
Process LMHB died, see its trace file
ORA-27302: failure occurred at: skgpspawn3
ORA-27301: OS failure message: Not enough space
ORA-27300: OS system dependent operation:fork failed with status: 12
Errors in file /oracle/app/oracle/diag/rdbms/gdjb/gdjb1/trace/gdjb1_ora_144242.trc:
p570a节点数据库down机是由于物理内存和页面交换空间已经使用完,无法得到请求引起的。
TNS-12500: TNS:监听器未能启动专用的服务器进程
TNS-12540: TNS:超出内部极限限制
TNS-12560: TNS: 协议适配器错误
TNS-00510: 超出内部极限限制
IBM/AIX RISC System/6000 Error: 12: Not enough space
监听日志也报无法请求外部连接错误。
物理内存
p570a
AIX
System Model: IBM,9117-MMA
Machine Serial Number: 066E9C5
Processor Type: PowerPC_POWER6
Processor Implementation Mode: POWER 6
Processor Version: PV_6_Compat
Number Of Processors: 8
Processor Clock Speed: 3504 MHz
CPU Type: 64-bit
Kernel Type: 64-bit
LPAR Info: 1 06-6E9C5
Memory Size: 15232 MB
Good Memory Size: 15232 MB
Platform. Firmware level: EM350_038
Firmware Version: IBM,EM350_038
Console Login: enable
Auto Restart: true
Full Core: false
可以看出总物理内存为15G左右
数据库A
SQL> show sga
Total System Global Area 2137886720 bytes
Fixed Size 2208496 bytes
Variable Size 1207962896 bytes
Database Buffers 922746880 bytes
Redo Buffers 4968448 bytes
SQL> show parameter sga
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
lock_sga boolean FALSE
pre_page_sga boolean FALSE
sga_max_size big integer 2G
sga_target big integer 2G
SQL> show parameter pga
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
pga_aggregate_target big integer 1G
SQL> show parameter instance_name
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
instance_name string gd1
可以看出A数据库占用3G物理内存
数据库B
SQL> show sga
Total System Global Area 8551575552 bytes
Fixed Size 2223904 bytes
Variable Size 1778385120 bytes
Database Buffers 6761218048 bytes
Redo Buffers 9748480 bytes
SQL> show parameter sga
NAME TYPE VALUE
lock_sga Boolean FALSE
pre_page_sga Boolean FALSE
sga_max_size big integer 8G
sga_target big integer 8G
SQL> show parameter instance_name
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
instance_name string gd2
SQL> show parameter pga
NAME TYPE VALUE
pga_aggregate_target big integer 2G
可以看出B数据库占用10G物理内存,分配的值占用总内存较多。
总物理内存15G,分配给两个数据库总共内存13G,只剩2G给操作系统使用,随着业务连接数增多或不释放等原因,很容易把物理内存和页面交换空间耗用完,导致数据库down机和主机挂起。
1) gzcdc数据库oracle内存参数值设置过大,建议调整,跟开发商,用户商量后,将gzcdc数据库sga调整为5G,pga设置为1G,这样操作系统还剩余7G。
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/7199859/viewspace-680613/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/7199859/viewspace-680613/