2014/6/5接渠道反馈,用户数据库意外宕机,后经过重启服务器数据库恢复正常,用户希望能够排查原因,避免再次出现宕机事故,这种意外宕机原因排查是我们远程处理经常遇到的案例,虽然宕机的原因有很多,但是排查的步骤基本一致,都一个固定的套路,接下来就介绍下如何一步步定位问题。
本案例这种情况,是属于事后分析,这种分析的方法只有一个,就是查看日志,如果是操作系统,就要查看操作系统的日志,是数据库自然就要查看数据库的日志。我们知道数据库有很多日志,但是最关键的日志就是alert日志,如果是RAC环境可能会涉及到CRS的日志,这里是单机环境,我们就只需要关注alert日志就可以,alert日志的路径和日志名称就不用太多介绍,作为DBA这点常识还是应该有,拿到alert日志首先就是查看数据库宕机时间段前后有什么异常记录,如下本案例在宕机前有如下错误提示信息:
Errors infile d:\oracle\product\10.2.0\admin\oraxy\bdump\oraxy_cjq0_4340.trc:
ORA-00604:递归SQL 级别 1出现错误
ORA-04031:无法分配32 字节的共享内存("shared pool","select job, nvl2(last_date, ...","sqlarea","tmp")
Wed Jun04 18:59:17 2014
Errors infile d:\oracle\product\10.2.0\admin\oraxy\bdump\oraxy_cjq0_4340.trc:
ORA-00604:递归SQL 级别 1出现错误
ORA-04031:无法分配32 字节的共享内存("shared pool","select count(*) from sys.job...","sqlarea","tmp")
Wed Jun04 18:59:22 2014
Errors infile d:\oracle\product\10.2.0\admin\oraxy\bdump\oraxy_cjq0_4340.trc:
ORA-00604:递归SQL 级别 1出现错误
ORA-04031:无法分配32 字节的共享内存("shared pool","select job, nvl2(last_date, ...","sqlarea","tmp")
Wed Jun04 18:59:22 2014
Errors infile d:\oracle\product\10.2.0\admin\oraxy\bdump\oraxy_cjq0_4340.trc:
ORA-00604:递归SQL 级别 1出现错误
ORA-04031:无法分配32 字节的共享内存("shared pool","select count(*) from sys.job...","sqlarea","tmp")
Wed Jun04 18:59:27 2014
Errors infile d:\oracle\product\10.2.0\admin\oraxy\bdump\oraxy_cjq0_4340.trc:
ORA-00604:递归SQL 级别 1出现错误
ORA-04031:无法分配32 字节的共享内存("shared pool","select job, nvl2(last_date, ...","sqlarea","tmp")
通过日志信息,可以初步定位到这次故障的起因是ora-00604和ora-04031引起,为了进一步准确定位,我们应该查看提示中的子跟踪文件,如上面的d:\oracle\product\10.2.0\admin\oraxy\bdump\oraxy_cjq0_4340.trc,这里面肯定包含了更多详细的信息。
跟踪日志这个查看就需要一些经验和对问题的敏感度定位错误了,根据跟踪对对象的不同,其日志内容格式也不同,如下是我们这次跟踪的关键信息:
SO:000007FF493D38A0, type: 4, owner: 000007FF49005208, flag: INIT/-/-/0x00
(session) sid: 543 trans: 0000000000000000,creator: 000007FF49005208, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-0016-00000003, short-termDID: 0000-0000-00000000
txn branch: 0000000000000000
oct: 0, prv: 0, sql:0000000000000000, psql: 0000000000000000, user: 0/SYS
last wait for 'SGA: allocation forcing component growth' blocking sess=0x0000000000000000 seq=30782wait_time=15629 seconds since wait started=0
=0, =0, =0
Dumping Session Wait History
for 'SGA: allocation forcing component growth'count=1 wait_time=15629
=0, =0, =0
for 'SGA: allocation forcing componentgrowth' count=1 wait_time=15006
=0, =0, =0
for 'latch: shared pool'count=1 wait_time=624
address=c96aed8, number=d6, tries=1
for 'latch: shared pool' count=1wait_time=1214
address=c96aed8, number=d6, tries=0
for 'latch: library cache' count=1wait_time=77
address=324ef0f0, number=d7, tries=0
for 'latch: shared pool' count=1wait_time=1369765
address=c96aed8, number=d6, tries=0
for 'rdbms ipc message' count=1wait_time=5007402
timeout=1f4, =0, =0
for 'rdbms ipc message' count=1wait_time=5006909
timeout=1f4, =0, =0
for 'rdbms ipc message' count=1wait_time=5007270
timeout=1f4, =0, =0
for 'rdbms ipc message' count=1wait_time=5004478
timeout=1f4, =0, =0
temporary object counter: 0
----------------------------------------
UOL used: 0 locks(used=1, free=4)
KGXAtomic Operation Log 000007FF35B23660
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Cursor Parent uid 543 efd 10 whr 4 slp 0
oper=NONE pt1=000007FF2DD5ECA8pt2=000007FF2DD5ED10 pt3=000007FF2DD5F230
pt4=0000000000000000 u41=2 stt=0
KGXAtomic Operation Log 000007FF35B236A8
Mutex 000007FF2A744D18(0, 12) idn 0 oper NONE
Cursor Stat uid 543 efd 11 whr 2 slp 0
oper=NONE pt1=000007FF2A744BE8pt2=0000000000000000 pt3=0000000000000000
pt4=0000000000000000 u41=0 stt=0
KGXAtomic Operation Log 000007FF35B236F0
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Library Cache uid 543 efd 0 whr 0 slp 0
从日志中,可以清楚的看到SGA: allocation forcing component growth等待事件,可以看到是由于SGA无法增长导致,也就是SGA被撑爆了,这也好解释为什么用户重启服务器就恢复正常,可以初步判断问题的关键就是这里,查看metalink上的针对这个等待时间的资料,如下:
High Waits for "SGA: allocation forcing component growth" and Slow Database Performance in 10g (文档 ID 1170814.1) | 转到底部 |
修改时间:2013-5-16类型:PROBLEM |
|
既然问题已经定位,我们就需要从SGA入手,为了保障不再出现类似问题,我主要从3个方面入手:
1. 调整SGA到一个更合理的值。
目前用户的SGA总值大小过小(5G),而用户的内存非常大(32G),数据库又是64位,这样的SGA设置明显不合理。
2. 关闭SGA自动管理,手工设定共享池、高速缓存区等内存大小,保证满足应用,调整语句如下
alter system set sga_max_size=15G scope=spfile
alter system set sga_target=0 scope=spfile
alter system log_buffer=50M scope=spfile
alter system set db_cache_size=10G scope=spfile
alter system set java_pool_size=100M scope=spfile
alter system set large_pool_size =100M scope=spfile
alter system set shared_pool_size=3G scope=spfile
3. 调整的时候注意对原始的参数文件进行备份,避免参数调整出现问题,好及时回退。