ORA-04031导致数据库宕机问题分析

背景介绍

2014/6/5接渠道反馈,用户数据库意外宕机,后经过重启服务器数据库恢复正常,用户希望能够排查原因,避免再次出现宕机事故,这种意外宕机原因排查是我们远程处理经常遇到的案例,虽然宕机的原因有很多,但是排查的步骤基本一致,都一个固定的套路,接下来就介绍下如何一步步定位问题。

分析步骤

步骤一.查询alert日志,查找错误信息

本案例这种情况,是属于事后分析,这种分析的方法只有一个,就是查看日志,如果是操作系统,就要查看操作系统的日志,是数据库自然就要查看数据库的日志。我们知道数据库有很多日志,但是最关键的日志就是alert日志,如果是RAC环境可能会涉及到CRS的日志,这里是单机环境,我们就只需要关注alert日志就可以,alert日志的路径和日志名称就不用太多介绍,作为DBA这点常识还是应该有,拿到alert日志首先就是查看数据库宕机时间段前后有什么异常记录,如下本案例在宕机前有如下错误提示信息:

Errors infile d:\oracle\product\10.2.0\admin\oraxy\bdump\oraxy_cjq0_4340.trc:

ORA-00604:递归SQL 级别 1出现错误

ORA-04031:无法分配32 字节的共享内存("shared pool","select job, nvl2(last_date, ...","sqlarea","tmp")

 

Wed Jun04 18:59:17 2014

Errors infile d:\oracle\product\10.2.0\admin\oraxy\bdump\oraxy_cjq0_4340.trc:

ORA-00604:递归SQL 级别 1出现错误

ORA-04031:无法分配32 字节的共享内存("shared pool","select count(*) from sys.job...","sqlarea","tmp")

 

Wed Jun04 18:59:22 2014

Errors infile d:\oracle\product\10.2.0\admin\oraxy\bdump\oraxy_cjq0_4340.trc:

ORA-00604:递归SQL 级别 1出现错误

ORA-04031:无法分配32 字节的共享内存("shared pool","select job, nvl2(last_date, ...","sqlarea","tmp")

 

Wed Jun04 18:59:22 2014

Errors infile d:\oracle\product\10.2.0\admin\oraxy\bdump\oraxy_cjq0_4340.trc:

ORA-00604:递归SQL 级别 1出现错误

ORA-04031:无法分配32 字节的共享内存("shared pool","select count(*) from sys.job...","sqlarea","tmp")

 

Wed Jun04 18:59:27 2014

Errors infile d:\oracle\product\10.2.0\admin\oraxy\bdump\oraxy_cjq0_4340.trc:

ORA-00604:递归SQL 级别 1出现错误

ORA-04031:无法分配32 字节的共享内存("shared pool","select job, nvl2(last_date, ...","sqlarea","tmp")

      通过日志信息,可以初步定位到这次故障的起因是ora-00604和ora-04031引起,为了进一步准确定位,我们应该查看提示中的子跟踪文件,如上面的d:\oracle\product\10.2.0\admin\oraxy\bdump\oraxy_cjq0_4340.trc这里面肯定包含了更多详细的信息。

步骤二.查看跟踪日志,查找更详细的信息

跟踪日志这个查看就需要一些经验和对问题的敏感度定位错误了,根据跟踪对对象的不同,其日志内容格式也不同,如下是我们这次跟踪的关键信息:

SO:000007FF493D38A0, type: 4, owner: 000007FF49005208, flag: INIT/-/-/0x00

  (session) sid: 543 trans: 0000000000000000,creator: 000007FF49005208, flag: (51) USR/- BSY/-/-/-/-/-

            DID: 0001-0016-00000003, short-termDID: 0000-0000-00000000

            txn branch: 0000000000000000

            oct: 0, prv: 0, sql:0000000000000000, psql: 0000000000000000, user: 0/SYS

  last wait for 'SGA: allocation forcing component growth' blocking sess=0x0000000000000000 seq=30782wait_time=15629 seconds since wait started=0

          =0, =0, =0

  Dumping Session Wait History

   for 'SGA: allocation forcing component growth'count=1 wait_time=15629

          =0, =0, =0

   for 'SGA: allocation forcing componentgrowth' count=1 wait_time=15006

          =0, =0, =0

   for 'latch: shared pool'count=1 wait_time=624

          address=c96aed8, number=d6, tries=1

   for 'latch: shared pool' count=1wait_time=1214

          address=c96aed8, number=d6, tries=0

   for 'latch: library cache' count=1wait_time=77

          address=324ef0f0, number=d7, tries=0

   for 'latch: shared pool' count=1wait_time=1369765

          address=c96aed8, number=d6, tries=0

   for 'rdbms ipc message' count=1wait_time=5007402

          timeout=1f4, =0, =0

   for 'rdbms ipc message' count=1wait_time=5006909

          timeout=1f4, =0, =0

   for 'rdbms ipc message' count=1wait_time=5007270

          timeout=1f4, =0, =0

   for 'rdbms ipc message' count=1wait_time=5004478

          timeout=1f4, =0, =0

  temporary object counter: 0

----------------------------------------

UOL used: 0 locks(used=1, free=4)

KGXAtomic Operation Log 000007FF35B23660

 Mutex 0000000000000000(0, 0) idn 0 oper NONE

 Cursor Parent uid 543 efd 10 whr 4 slp 0

 oper=NONE pt1=000007FF2DD5ECA8pt2=000007FF2DD5ED10 pt3=000007FF2DD5F230

 pt4=0000000000000000 u41=2 stt=0

KGXAtomic Operation Log 000007FF35B236A8

 Mutex 000007FF2A744D18(0, 12) idn 0 oper NONE

 Cursor Stat uid 543 efd 11 whr 2 slp 0

 oper=NONE pt1=000007FF2A744BE8pt2=0000000000000000 pt3=0000000000000000

 pt4=0000000000000000 u41=0 stt=0

KGXAtomic Operation Log 000007FF35B236F0

 Mutex 0000000000000000(0, 0) idn 0 oper NONE

 Library Cache uid 543 efd 0 whr 0 slp 0

从日志中,可以清楚的看到SGA: allocation forcing component growth等待事件,可以看到是由于SGA无法增长导致,也就是SGA被撑爆了,这也好解释为什么用户重启服务器就恢复正常,可以初步判断问题的关键就是这里,查看metalink上的针对这个等待时间的资料,如下:

High Waits for "SGA: allocation forcing component growth" and Slow Database Performance in 10g (文档 ID 1170814.1) 转到底部
修改时间:2013-5-16类型:PROBLEM

In this Document

Symptoms
  Cause
  Solution


This document is being delivered to you via Oracle Support's Rapid Visibility (RaV) process and therefore has not been subject to an independent technical review.

APPLIES TO:

Oracle Database - Enterprise Edition - Version 10.2.0.1 to 10.2.0.5 [Release 10.2]
Information in this document applies to any platform.
Checked for Relevance 16-MAY-13

SYMPTOMS

Slow database performance in peak hours.

The AWR shows the following in Top 5 waits table.

Event                                    Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
CPU time                                           113                           85.4
db file sequential read                  6,831       7            1               5.4 User I/O 
SGA: allocation forcing component growth   196       3           16               2.3 Other
db file scattered read                   2,723       2            1               1.5 User I/O
kksfbc child completion                     17       1           56                .7 Other 

The top waits sections show "SGA: allocation forcing component growth" during peak time which is not there in normal load.

CAUSE


The waits for "SGA: allocation forcing component growth" indicates that SGA is inadequate.
It looks like there is no enough memory available in SGA for the growth of components.

SOLUTION

You may need to find which SGA component is in huge demand, either shared pool or buffer cache. 
First check the "Cache Sizes" section in AWR to identify which SGA component demands growth and shrink. 

To check for shared pool, look at the shared pool statistics and library cache activity.

Shared Pool Statistics

                            Begin End 
Memory Usage %:             96.73 96.13 <= High usage 
% SQL with executions>1:    85.05 86.02 
% Memory for SQL w/exec>1:  86.32 90.50

...
Library Cache Activity 
"Pct Misses" should be very low

Namespace Get Requests Pct Miss Pin Requests Pct Miss Reloads Invali- 
                                                              dations
BODY             3,923     0.23        7,832    24.85   1,926       0
CLUSTER            544     0.00        3,588     0.70      25       0 
INDEX               20    95.00           22    86.36       0       0 
SQL AREA        86,033    19.06      204,090    25.49  19,348      20

It it clearly evident that the shared pool is in high demand as there are huge reloads happening. 

Check the "Shared Pool Advisory" section and select the optimum value where "Est LC Time Saved Factr" and "Est LC Load Time Factr" correspond to 1. 
Then check buffer cache from "Buffer Pool Advisory" section and select the optimum value for buffer cache where "Est Phys Read Factor" corresponds to 1. 

Increase the SGA_TARGET and SGA_MAX_SIZE to accommodate the SGA component growth and set the minimum value for shared pool and buffer cache based on the advisory so that the Automatic SGA would not shrink the component below the threshold thereby avoiding the contention in the respective component and performance would not be degraded.

To set minimum value for SGA components,

shared_pool_size=<value>M 
db_cache_size=<value>M

步骤三.解决之道

      既然问题已经定位,我们就需要从SGA入手,为了保障不再出现类似问题,我主要从3个方面入手:

1.   调整SGA到一个更合理的值。

目前用户的SGA总值大小过小(5G),而用户的内存非常大(32G),数据库又是64位,这样的SGA设置明显不合理。

2.   关闭SGA自动管理,手工设定共享池、高速缓存区等内存大小,保证满足应用,调整语句如下

alter system set sga_max_size=15G       scope=spfile
alter system set sga_target=0        scope=spfile
alter system log_buffer=50M scope=spfile
alter system set db_cache_size=10G scope=spfile
alter system set  java_pool_size=100M scope=spfile
alter system set  large_pool_size =100M scope=spfile
alter system set  shared_pool_size=3G  scope=spfile

3.   调整的时候注意对原始的参数文件进行备份,避免参数调整出现问题,好及时回退。

你可能感兴趣的:(ORA-04031导致数据库宕机问题分析)