一次由awr入手,解决经常down机的问题

因为涉及隐私,某些部分用xxx代替


针对xxx省总是down机,从系统日志看,貌似是asm访问磁盘的权限丢失,不错,down后自动拉起,权限是已经丢失了,这个在建库的时候也遇到过。但是为什么总是6-7down机,我分析了down机前的awr报告,和down机后,手工恢复后的awr报告。

down前:

Snap Id

Snap Time

Sessions

Cursors/Session

Begin Snap:

7033

16-Aug-12 16:00:43

175

1.3

End Snap:

7034

16-Aug-12 17:00:20

180

1.2

Elapsed:

59.63 (mins)

DB Time:

1,425.64 (mins)

Snap Id

Snap Time

Sessions

Cursors/Session

Begin Snap:

7034

16-Aug-12 17:00:20

180

1.2

End Snap:

7035

16-Aug-12 18:00:12

261

1.3

Elapsed:

59.86 (mins)

DB Time:

2,518.78 (mins)

DB Time不包括Oracle后台进程消耗的时间。如果DB Time远远小于Elapsed时间,说明数据库比较空闲,但是我们新疆数据库的db time1425.64在半小时内上升到2518.78elapsed59.86

65分钟里,数据库耗时2,518分钟,RDA数据中显示系统(48CPU 24核),平均每个CPU耗时42分钟,CPU负载大约83%,说明系统压力非常大. sessions 180上升到260.而且此时的省部接口正在采集数据,有一个大表,至少一小时也有6000W-8000w的数据。

Load Profile

Per Second

Per Transaction

Per Exec

Per Call

DB Time(s):

42.1

33.2

3.44

6.57

DB CPU(s):

1.1

0.9

0.09

0.18

Redo size:

108,165.2

85,376.8

Logical reads:

14,955.9

11,805.0

Block changes:

667.4

526.8

Physical reads:

13,438.9

10,607.5

Physical writes:

46.4

36.6

User calls:

6.4

5.1

Parses:

6.2

4.9

Hard parses:

0.1

0.1

W/A MB processed:

0.3

0.2

Logons:

0.4

0.3

Executes:

12.2

9.7

Rollbacks:

0.0

0.0

Transactions:

1.3

从此项可以看出,physical reads 13438

physical writes 46.4,

也可以推断出,读的压力远远大于写,可以看出后台入库sqlldr的压力远远小于省部接口的读数据的压力。

Shared Pool Statistics

Begin

End

Memory Usage %:

48.19

48.33

% SQL with executions>1:

96.71

96.76

% Memory for SQL w/exec>1:

91.31

89.65

消耗内存的sql突然增多,然后我从top5看出,hash_valuesql_id ,为省部接口的ps_xxxxlogp2p业务。

select xxxxx

FROM (
SELECT /*+ no_index(t) */ TRUNC(STxME, 'hh24') TIME,SxAN,RxC,RxC,CxL,rxype,t.ax

,SUM(xxxs) Appxmes,SUM(Txxic) Traffic
FROM PS_XXX t, cfg_apn apn
WHERE T.STxIME >= TO_DATE('20120813 19:00:00', 'yyyymmdd hh24:mi:ss')
AND T.STARxME < TO_DATE('20120813 20:00:00', 'yyyymmdd hh24:mi:ss')
AND upper(t.xn)=upper(apn.apn)
GROUP BY TRUNC(STxIME, 'hh24'),Sx_AN,RxC,RxC,xL,raxpe,t.axn
) t ,CFG_DEVICEMAP SGSNxM,CFG_DEVxP RNxDM,CFG_DEVxAP RACDM,MV_DxALUES d
WHERE
t.SxN = SxN_AxxLUE(+)
.....

省略

sql,没有apptype不说,肯定不行的,另外,请看它的where条件是starttime,问题就在这里,此表分区字段是endtime,所以跑不出来,内存耗尽。

下面看看

Top 5 Timed Foreground Events

Event

Waits

Time(s)

Avg wait (ms)

% DB time

Wait Class

direct path read

1,128,304

117,891

104

78.01

User I/O

db file scattered read

119,308

23,548

197

15.58

User I/O

DB CPU

4,068

2.69

db file sequential read

15,750

2,072

132

1.37

User I/O

direct path read temp

6,439

1,134

176

0.75

User I/O

更加印证了我的猜测,最严重的5 wait,其中read占了4个。而且系统运行良好的时候应该是cpu time排第一。


其他的暂时不用看了,以上已经说明了有很大问题。即使是存储有问题。但是以上的分析也占了问题的70%以上。

明天就开始处理吧。调整一下省部接口。控制一下高峰期的sessions,监控一下内存状态。




今天早上再次检查新疆CS,错误操作系统错误日志如下:

ress: w5001438004c99e8d,1 is online Load balancing: round-robin

Aug 14 02:31:47 racdb1 scsi: [ID 243001 kern.info] /pci@1,700000/lpfc@0/fp@0,0 (fcp4):

Aug 14 02:31:47 racdb1 ndi_devi_online: failed for array-controller: target=11500 lun=0 ffffffff

Aug 14 02:31:48 racdb1 scsi: [ID 243001 kern.info] /pci@11,700000/lpfc@0/fp@0,0 (fcp3):

Aug 14 02:31:48 racdb1 ndi_devi_online: failed for array-controller: target=11500 lun=0 ffffffff

Aug 14 02:31:48 racdb1 scsi: [ID 243001 kern.info] /pci@1,700000/lpfc@0/fp@0,0 (fcp4):

Aug 14 02:31:48 racdb1 ndi_devi_online: failed for array-controller: target=11500 lun=0 ffffffff

Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g6001438007f3087000004000007c0000 (ssd88):

Aug 14 02:32:35 racdb1 Error for Command: write(10) Error Level: Retryable

Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.notice] Requested Block: 78104 Error Block: 0

Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.notice] Vendor: HP Serial Number: C70000040000

Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention

Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0

Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g6001438007f3087000004000007c0000 (ssd88):

Aug 14 02:32:35 racdb1 Error for Command: write(10) Error Level: Retryable

Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.notice] Requested Block: 78104 Error Block: 0

Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.notice] Vendor: HP Serial Number: C70000040000

Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention

Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0

Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g6001438007f308700000400000840000 (ssd90):

Aug 14 02:34:56 racdb1 Error for Command: write(10) Error Level: Retryable

Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.notice] Requested Block: 82424 Error Block: 0

Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.notice] Vendor: HP Serial Number: 480000040000

Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention

Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0

Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g6001438007f308700000400000840000 (ssd90):

Aug 14 02:34:56 racdb1 Error for Command: write(10) Error Level: Retryable

Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.notice] Requested Block: 82424 Error Block: 0

Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.notice] Vendor: HP Serial Number: 480000040000

Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention

Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0

Aug 14 06:05:29 racdb1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g6001438007f308700000400000800000 (ssd91):

Aug 14 06:05:29 racdb1 Error for Command: read(10) Error Level: Retryable

Aug 14 06:05:29 racdb1 scsi: [ID 107833 kern.notice] Requested Block: 83968 Error Block: 0

Aug 14 06:05:29 racdb1 scsi: [ID 107833 kern.notice] Vendor: HP Serial Number: 080000040000

Aug 14 06:05:29 racdb1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention

Aug 14 06:05:29 racdb1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0

Aug 14 06:05:30 racdb1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g6001438007f308700000400000800000 (ssd91):

Aug 14 06:05:30 racdb1 Error for Command: read(10) Error Level: Retryable

Aug 14 06:05:30 racdb1 scsi: [ID 107833 kern.notice] Requested Block: 83968 Error Block: 0

Aug 14 06:05:30 racdb1 scsi: [ID 107833 kern.notice] Vendor: HP Serial Number: 080000040000

Aug 14 06:05:30 racdb1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention

Aug 14 06:05:30 racdb1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0




该错误,在yingtingkun 的bolg 也看到过,老杨很无语,说了一句:

“很显然,正是由于操作系统或者硬件上的故障,导致了这个错误的产生”


现在应用已经调整,看看,今天下午的情况。


后续......(下午再写)



Hi all;
从昨晚和xxxxx兄弟交流了一下,调整了应用观察,情况还是比较乐观的。当然还有很多需要调整的地方。
对于他说到的三点,他的第一点可以,还有我们的表分区,子分区,肯定是实现了的。只是xxxxxx业务,
调整前(经常down机):

Top 5 Timed Foreground Events

Event

Waits

Time(s)

Avg wait (ms)

% DB time

Wait Class

direct path read

1,128,304

117,891

104

78.01

User I/O

db file scattered read

119,308

23,548

197

15.58

User I/O

DB CPU

4,068

2.69

db file sequential read

15,750

2,072

132

1.37

User I/O

direct path read temp

6,439

1,134

176

0.75

User I/O

调整后的 top5

Top 5 Timed Foreground Events

Event Waits Time(s) Avg wait (ms) % DB time Wait Class
DB CPU 506 70.60
direct path read 8,243 156 19 21.71 User I/O
db file scattered read 2,036 12 6 1.69 User I/O
SQL*Net break/reset to client 1,236 11 9 1.57 Application
db file sequential read 1,915 10 5 1.34 User I/O

Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 7058 17-Aug-12 17:00:26 246 1.2
End Snap: 7059 17-Aug-12 18:00:02 216 1.1
Elapsed: 59.59 (mins)
DB Time: 11.95 (mins)

DB TIME=DB CPU+Foreground wait time+Cpu on queue;

每个cpu 的可用处理时间是3575.4 s, 11.95/(59.59*48)=0.0041,说明系统负载几乎不存在,空闲。


注意观察: direct path read 下降的情况

direct path read较高的可能原因有:

  1. 大量的磁盘排序操作,order by, group by, union, distinct, rollup, 无法在PGA中完成排序,需要利用temp表空间进行排序。 当从临时表空间中读取排序结果时,会产生direct path read.

  2. 大量的Hash Join操作,利用temp表空间保存hash区。

  3. SQL语句的并行处理

  4. 大表的全表扫描,在中,全表扫描的算法有新的变化,根据表的大小、高速缓存的大小等信息,决定是否绕过SGA直接从磁盘读Oracle11g取数据

而在我们xx这种OLAP环境下:

1.group by ,order by,union

2.SQL语句的并行处理

3. 大表的全表扫描

以上3种,经常可见。所以direct path read 在某个时刻异常偏高。绕过buffer cache,这样的全表扫描就是物理读了。导致IO 压力大,存储崩溃。

调整前

User I/O Time (s) Executions UIO per Exec (s) %Total Elapsed Time (s) %CPU %IO SQL Id SQL Module SQL Text
27,684.05 0 19.10 28,929.23 4.20 95.70 c2mfjtnub94s9 PL/SQL Developer select /*+ parallel(dt, 8) */ ...
14,921.78 1 14,921.78 10.30 15,128.51 1.41 98.63 fyd2m5r2r0h1v PL/SQL Developer select /*+ parallel(dt, 8) */ ...
8,823.28 1 8,823.28 6.09 9,012.14 2.15 97.90 f8462t1hfuprf PL/SQL Developer ---??????? -------------------...
6,548.98 1 6,548.98 4.52 6,648.51 1.45 98.50 3nb6rxfbrhqdk PL/SQL Developer select /*+ parallel(dt, 8) */ ...
5,832.73 1 5,832.73 4.02 6,073.87 3.94 96.03 aa2usv1skn956 PL/SQL Developer select /*+ parallel(dt, 8) */ ...
3,553.17 0 2.45 3,616.40 1.74 98.25 a25hq3mq7gk8p JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,552.25 0 2.45 3,616.62 1.79 98.22 aj0yb5wrudxha JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,552.22 0 2.45 3,616.53 1.79 98.22 fs1uud29zrqwp JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,552.07 0 2.45 3,616.48 1.77 98.22 0ty0c58vgdgz5 JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,551.84 0 2.45 3,616.49 1.80 98.21 gwhagwnqp38gy JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,551.49 0 2.45 3,616.45 1.78 98.20 f09j416qpvhms JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,551.47 0 2.45 3,616.61 1.78 98.20 bsztwnq8puqm1 JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,551.04 0 2.45 3,616.56 1.76 98.19 cdjyqhm9tn0ak JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,551.01 0 2.45 3,616.55 1.82 98.19 gz4w2u8k1zjgx JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,550.82 0 2.45 3,616.54 1.81 98.18 6x4vgyfz9z011 JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,550.33 0 2.45 3,616.38 1.85 98.17 17jggncfp816r JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,550.27 0 2.45 3,616.41 1.82 98.17 70gz8cajpf3s4 JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,550.18 0 2.45 3,616.49 1.84 98.17 568h19gn1v942 JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,549.66 0 2.45 3,616.40 1.79 98.15 b58g6ur474c2y JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,548.89 0 2.45 3,616.43 1.84 98.13 dnb58gxnxh6ru JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,548.60 0 2.45 3,616.23 1.85 98.13 1g58wy4nprp6c JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,548.23 0 2.45 3,616.35 1.82 98.12 3y7zh3ag4wh5w JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,547.75 0 2.45 3,616.49 1.87 98.10 0ay5juvzsqqvu JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,547.43 0 2.45 3,616.49 1.85 98.09 f9vaxg3kd6sc8 JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
3,546.20 0 2.45 3,616.38 1.87 98.06 dmxgtdjqf88qa JDBC Thin Client SELECT to_char(TIME, 'yyyy-mm-...
排在前几位的,有可能是xxx人员在测试,或者相关人员在挑战数据库的性能吧。然后接着就是大家熟悉的java程序的调用 JDBC Thin Client (xxxxxx)。就是那几个parallel query导致的direct path read,而且direct path read temp应该也是这个导致,因为sql包含大量分组。排序到不存在。我特别抓了几条sql,有网优ps的3G分组网,有union 操作,所谓的排序。


调整后:

User I/O Time (s) Executions UIO per Exec (s) %Total Elapsed Time (s) %CPU %IO SQL Id SQL Module SQL Text
77.77 1 77.77 43.38 116.51 33.41 66.75 28yaqv6bnb6xc PL/SQL Developer select /*+ parallel(dt, 8) */ ...
72.03 1 72.03 40.18 111.66 34.93 64.51 0zq83gnkbgyrp PL/SQL Developer select /*+ parallel(dt, 8) */ ...
10.84 2 5.42 6.05 317.02 95.90 3.42 ddgcvawy1q8ba [3600:?????????] SELECT /*+ parallel(8) index(t...
10.01 1 10.01 5.58 18.57 44.05 53.88 5r3r7thkjsx94 PL/SQL Developer select trunc(starttime, 'hh24'...
5.68 3 1.89 3.17 6.39 8.29 88.94 9k75s8j10g6v7 SELECT /* OPT_DYN_SAMP */ /*+ ...
5.45 1 5.45 3.04 6.50 7.85 83.97 6q53sucwnb9sr i-Signal.exe select /*+ parallel(4) first_r...
5.13 2 2.57 2.86 5.88 5.61 87.23 130dvvr5s8bgn select obj#, dataobj#, part#, ...
3.80 1 3.80 2.12 4.28 8.89 88.82 48vbtndvcbhju SELECT /* OPT_DYN_SAMP */ /*+ ...
2.11 1 2.11 1.18 2.17 2.31 97.22 350myuyx0t1d6 insert into wrh$_tablespace_st...
0.22 1 0.22 0.13 0.73 45.05 30.66 6ajkhukk78nsr begin prvt_hdm.auto_execute( :...

问题已经搞定一部分,如果有时间,我们再继续研究下面的内容吧,wait event 等等,已经不早了,明天继续吧。

注意:省部接口的程序还有调整的可能性,parallel的 系数大小。并且应用要考虑和其他忙时的应用错开运行。

还有我看了看,新疆网优SUN OS 总共内存为128G ,那么我们的数据库设置sga_max_size=76G sga_target=28G,其实这么设置有点紧,sga_max_size 一般为60% ,但是这个60%并不是 128*0.6=76.8 G, 应该排除OS 所用20% 之后的,60%。 (128-128*0.2)*0.6= 61.4G

sga_target 在11g 新特性中,应该在设置后,重启后,与sga_max_size相等。理论是当SGA_TARGET < SGA_MAX_SIZE时,实例重启以后SGA_MAX_SIZE就变成SGA_TARGET的大小。但是我看看新疆的状态是sga_target=28G。

SGA

* Buffer cache (DB_CACHE_SIZE)

* Shared pool (SHARED_POOL_SIZE)

* Large pool (LARGE_POOL_SIZE)

* Java pool (JAVA_POOL_SIZE)

* Streams pool (STREAMS_POOL_SIZE)


你可能感兴趣的:(sql,command,user,HP,transactions)