Oracle 11g新特性direct path read引发的系统停运故障诊断处理2

1、故障现象

(1)一个业务系统输入用户名与密码后无法进入首页,表现为一直在运行等待,运行缓慢

(2)整个系统无法正常使用,接近停运状态

2、  故障解决方法

调整数据库参数alter system setevent='10949 trace name context forever, level 1'来关闭“direct path read”(直接路径读)特性,使SQL语句可以从缓存中查询数据,达到降低I/O读取量,使全表扫描的数据从缓存中读取,加快SQL语句运行速度的目的。

3、  故障原因总结

(1)由于部分SQL语句设计或编写效率低下,以及表缺少适应的索引,导致SQL语句需要全表扫描,在表较小时,ORACLE数据库将数据读取到缓存后,后续虽然是全表扫描,但均是从缓存中读取,所以问题未体现出来

(2)在表的大小不断增大后,根据ORACLE 11g数据库的算法,在表达到db_cache_size(GB)的2%(默认值)以后,认为采用直接路径读(跳过缓存,直接从磁盘文件中全扫描读取)

(3)DX_T_XXVIATE表大小为1GB,在大量反复以direct pathread磁盘重复读取的情况下,消耗大量的I/O资源,将服务器I/O几乎耗尽

(4)在主机I/O耗尽的情况下,系统的读、写,均几乎处于瘫痪状态

(5)在关闭ORACLE 11G数据库的direct path read新特性功能后,读取方式恢复到从缓存中读取,磁盘读降到“0”,系统恢复正常

4、  改进建议

(1) 优化访问DX_T_XXVIATE 相关的SQL语句与设计合适的索引,避免大表全表扫描。

5、  故障原因分析

5.1  7月11日故障时段数据库服务器I/O等待严重

Oracle 11g新特性direct path read引发的系统停运故障诊断处理2_第1张图片

5.2  7月11日故障时段磁盘响应非常缓慢

 Oracle 11g新特性direct path read引发的系统停运故障诊断处理2_第2张图片

5.3  对比故障当日(7月11日)与上周的I/O磁盘读取量,比上周大十倍

 Oracle 11g新特性direct path read引发的系统停运故障诊断处理2_第3张图片

故障前、中、后磁盘读取量对比图:

Oracle 11g新特性direct path read引发的系统停运故障诊断处理2_第4张图片

上面高的蓝色线,是故障当日(2016年7月1日,周一)的磁盘Disk Read KB/s指标线

5.4  高度消耗I/O的SQL语句。

SQL_ID为b8m6wy846qgbk的SQL语句,physical reads鹤立鸡群,可见此SQL语句的影响最为严重。

 

5.5  全表扫描单次超过6秒的表与其SQL语句统计。

统计汇总时间:08:00—10:00

 

SQL_ID

target

出现次数

b8m6wy846qgbk

DX_T_XXVIATE

22

6ut0tkd7hvagc

DX_T_XXVIATE

6

1wq3kmd62x4x8

DX_T_XXVIATE

2

b5ma9qfs62bdm

DX_T_XXVIATE

2

f5hujtmsh3pd1

DX_T_XXVIATE

2

65wk2453uk930

DX_T_XXVIATE

2

6yxk40mn2rnzj

DX_T_XXVIATE

2

8h1bzhfm6purf

DX_T_XXVIATE

2

a1r7xr03naax2

DX_T_XXVIATE

2

6yn7ccy6bbw4h

DX_T_XXVIATE

2

6ar757tgxd1sf

DX_T_XXVIATE

2

gu8cjd8c35kys

DX_T_XXVIATE

2

41j3xyugbn6ck

DX_T_XXVIATE

2

98jagc6q2zr4x

DX_T_XXVIATE

2

1a4rwwt36w33d

DX_T_XXVIATE

2

b4bq40ybpt7a8

DX_T_XXVIATE

1

30za43khyxz6t

DX_T_XXVIATE

1

66j38hs9a0yzh

DX_T_XXVIATE

1

25djwmppxq9qt

DX_T_XXVIATE

1

84pdvmjyb9103

DX_T_XXVIATE

1

b69frxfps0k4t

DX_T_XXVIATE

1

 

统计时间:08:00—10:00 单次扫描超过6秒的SQL语句及时长详细清单:

SQL_ID

target

elapsed_seconds

start_time

f5hujtmsh3pd1

DX_T_XXVIATE

24

2016/7/11 10:02

f5hujtmsh3pd1

DX_T_XXVIATE

26

2016/7/11 10:02

66j38hs9a0yzh

DX_T_XXVIATE

28

2016/7/11 9:51

b8m6wy846qgbk

DX_T_XXVIATE

19

2016/7/11 9:48

b4bq40ybpt7a8

DX_T_XXVIATE

30

2016/7/11 9:41

6ut0tkd7hvagc

DX_T_XXVIATE

23

2016/7/11 9:38

b8m6wy846qgbk

DX_T_XXVIATE

27

2016/7/11 9:37

b8m6wy846qgbk

DX_T_XXVIATE

25

2016/7/11 9:35

b8m6wy846qgbk

DX_T_XXVIATE

28

2016/7/11 9:35

b8m6wy846qgbk

DX_T_XXVIATE

37

2016/7/11 9:34

65wk2453uk930

DX_T_XXVIATE

38

2016/7/11 9:34

8h1bzhfm6purf

DX_T_XXVIATE

31

2016/7/11 9:33

65wk2453uk930

DX_T_XXVIATE

34

2016/7/11 9:33

8h1bzhfm6purf

DX_T_XXVIATE

30

2016/7/11 9:33

b8m6wy846qgbk

DX_T_XXVIATE

32

2016/7/11 9:32

b8m6wy846qgbk

DX_T_XXVIATE

33

2016/7/11 9:32

b8m6wy846qgbk

DX_T_XXVIATE

30

2016/7/11 9:32

b8m6wy846qgbk

DX_T_XXVIATE

31

2016/7/11 9:32

b8m6wy846qgbk

DX_T_XXVIATE

31

2016/7/11 9:31

b8m6wy846qgbk

DX_T_XXVIATE

45

2016/7/11 9:30

84pdvmjyb9103

DX_T_XXVIATE

39

2016/7/11 9:29

b8m6wy846qgbk

DX_T_XXVIATE

39

2016/7/11 9:25

6ut0tkd7hvagc

DX_T_XXVIATE

33

2016/7/11 9:24

b8m6wy846qgbk

DX_T_XXVIATE

36

2016/7/11 9:21

6yxk40mn2rnzj

DX_T_XXVIATE

40

2016/7/11 9:21

6yxk40mn2rnzj

DX_T_XXVIATE

31

2016/7/11 9:20

b5ma9qfs62bdm

DX_T_XXVIATE

36

2016/7/11 9:19

b5ma9qfs62bdm

DX_T_XXVIATE

39

2016/7/11 9:19

b8m6wy846qgbk

DX_T_XXVIATE

98

2016/7/11 9:10

6ut0tkd7hvagc

DX_T_XXVIATE

41

2016/7/11 9:09

6ut0tkd7hvagc

DX_T_XXVIATE

41

2016/7/11 9:08

gu8cjd8c35kys

DX_T_XXVIATE

48

2016/7/11 9:06

gu8cjd8c35kys

DX_T_XXVIATE

47

2016/7/11 9:05

30za43khyxz6t

DX_T_XXVIATE

96

2016/7/11 9:00

b69frxfps0k4t

DX_T_XXVIATE

77

2016/7/11 8:59

6ut0tkd7hvagc

DX_T_XXVIATE

26

2016/7/11 8:48

98jagc6q2zr4x

DX_T_XXVIATE

23

2016/7/11 8:46

98jagc6q2zr4x

DX_T_XXVIATE

22

2016/7/11 8:46

6ar757tgxd1sf

DX_T_XXVIATE

28

2016/7/11 8:44

6ar757tgxd1sf

DX_T_XXVIATE

23

2016/7/11 8:44

6ut0tkd7hvagc

DX_T_XXVIATE

22

2016/7/11 8:43

a1r7xr03naax2

DX_T_XXVIATE

19

2016/7/11 8:42

a1r7xr03naax2

DX_T_XXVIATE

23

2016/7/11 8:42

b8m6wy846qgbk

DX_T_XXVIATE

25

2016/7/11 8:40

1wq3kmd62x4x8

DX_T_XXVIATE

23

2016/7/11 8:39

1wq3kmd62x4x8

DX_T_XXVIATE

21

2016/7/11 8:39

b8m6wy846qgbk

DX_T_XXVIATE

45

2016/7/11 8:30

b8m6wy846qgbk

DX_T_XXVIATE

41

2016/7/11 8:30

b8m6wy846qgbk

DX_T_XXVIATE

27

2016/7/11 8:27

41j3xyugbn6ck

DX_T_XXVIATE

29

2016/7/11 8:27

41j3xyugbn6ck

DX_T_XXVIATE

48

2016/7/11 8:26

1a4rwwt36w33d

DX_T_XXVIATE

53

2016/7/11 8:21

1a4rwwt36w33d

DX_T_XXVIATE

54

2016/7/11 8:20

b8m6wy846qgbk

DX_T_XXVIATE

42

2016/7/11 8:19

b8m6wy846qgbk

DX_T_XXVIATE

52

2016/7/11 8:18

b8m6wy846qgbk

DX_T_XXVIATE

54

2016/7/11 8:16

6yn7ccy6bbw4h

DX_T_XXVIATE

118

2016/7/11 8:14

6yn7ccy6bbw4h

DX_T_XXVIATE

81

2016/7/11 8:12

b8m6wy846qgbk

DX_T_XXVIATE

84

2016/7/11 8:10

25djwmppxq9qt

DX_T_XXVIATE

49

2016/7/11 8:09

 

上面数据显示,08:00—10:00统计时间内,所有全表扫描超过6秒的表,全部是DX_T_XXVIATE这一张表,涉及到的SQL语句有60多条,执行次数最多的数SQL_ID为b8m6wy846qgbk的语句。

5.6  全表扫描最严重SQL语句故障前、后、故障解决后磁盘读取数量对比

5.6.1  7月11日以前系统运行正常的情况下SQL_ID为b8m6wy846qgbk的语句执行统计信息

--执行统计信息(buffer get很大,但是disk reads为0,判定数据基本从buffer中读取):

 Oracle 11g新特性direct path read引发的系统停运故障诊断处理2_第5张图片

--执行计划(对DT_T_OBVIATE全表扫描,预计时间为5分钟30秒):

Oracle 11g新特性direct path read引发的系统停运故障诊断处理2_第6张图片

 

5.6.2  7月11日故障当日SQL_ID为b8m6wy846qgbk的语句执行统计信息

--执行统计信息(buffer get和disk reads都一样的巨大,基本判定每次数据全是从磁盘读取到BUFFER):

 Oracle 11g新特性direct path read引发的系统停运故障诊断处理2_第7张图片

--执行计划(对DT_T_OBVIATE全表扫描,预计时间为5分钟30秒,从执行计划的PHV和plan均看出执行计划在系统故障时与正常时,是保持一致的):

Oracle 11g新特性direct path read引发的系统停运故障诊断处理2_第8张图片

 

5.6.3  故障解决后(取7月12日数据)SQL_ID为b8m6wy846qgbk的语句执行统计信息

--执行统计信息(故障解决后,PVH值不变,Disk Reads又恢复到了故障前的“0”,说明每次执行数据又是从BUFFER中读取的):

Oracle 11g新特性direct path read引发的系统停运故障诊断处理2_第9张图片

 

5.7 等待事件变化识别数据读取方式变化比较

SQL_ID

日期

direct path read
发生次数

b8m6wy846qgbk

2016/7/12

0

b8m6wy846qgbk

2016/7/11

4141

b8m6wy846qgbk

2016/7/10

663

b8m6wy846qgbk

2016/7/9

0

b8m6wy846qgbk

2016/7/8

0

b8m6wy846qgbk

2016/7/7

0

 

看来,系统实际上在2016年7月10日(周日),SQL语句的数据读取方式就发生了少量的direct path read,系统实际上已经处于间歇式缓慢状态,到了2016年7月11日(周一),问题特别严重,约99%左右的执行是direct path read,导致I/O耗尽,系统瘫痪。


本文作者:黎俊杰(网名:踩点),从事”系统架构、操作系统、存储设备、数据库、中间件、应用程序“六个层面系统性的性能优化工作

欢迎加入 系统性能优化专业群,共同探讨性能优化技术。群号:258187244


你可能感兴趣的:(ORACLE,性能调优)