因为涉及隐私,某些部分用xxx代替
针对xxx省总是down机,从系统日志看,貌似是asm访问磁盘的权限丢失,不错,down后自动拉起,权限是已经丢失了,这个在建库的时候也遇到过。但是为什么总是6-7点down机,我分析了down机前的awr报告,和down机后,手工恢复后的awr报告。
down前:
Snap Id |
Snap Time |
Sessions |
Cursors/Session |
|
Begin Snap: |
7033 |
16-Aug-12 16:00:43 |
175 |
1.3 |
End Snap: |
7034 |
16-Aug-12 17:00:20 |
180 |
1.2 |
Elapsed: |
59.63 (mins) |
|||
DB Time: |
1,425.64 (mins) |
Snap Id |
Snap Time |
Sessions |
Cursors/Session |
|
Begin Snap: |
7034 |
16-Aug-12 17:00:20 |
180 |
1.2 |
End Snap: |
7035 |
16-Aug-12 18:00:12 |
261 |
1.3 |
Elapsed: |
59.86 (mins) |
|||
DB Time: |
2,518.78 (mins) |
DB Time不包括Oracle后台进程消耗的时间。如果DB Time远远小于Elapsed时间,说明数据库比较空闲,但是我们新疆数据库的db time从1425.64在半小时内上升到2518.78而elapsed是59.86
在65分钟里,数据库耗时2,518分钟,RDA数据中显示系统(48个CPU 24核),平均每个CPU耗时42分钟,CPU负载大约83%,说明系统压力非常大. sessions 由180上升到260.而且此时的省部接口正在采集数据,有一个大表,至少一小时也有6000W-8000w的数据。
Load Profile
Per Second |
Per Transaction |
Per Exec |
Per Call |
|
DB Time(s): |
42.1 |
33.2 |
3.44 |
6.57 |
DB CPU(s): |
1.1 |
0.9 |
0.09 |
0.18 |
Redo size: |
108,165.2 |
85,376.8 |
||
Logical reads: |
14,955.9 |
11,805.0 |
||
Block changes: |
667.4 |
526.8 |
||
Physical reads: |
13,438.9 |
10,607.5 |
||
Physical writes: |
46.4 |
36.6 |
||
User calls: |
6.4 |
5.1 |
||
Parses: |
6.2 |
4.9 |
||
Hard parses: |
0.1 |
0.1 |
||
W/A MB processed: |
0.3 |
0.2 |
||
Logons: |
0.4 |
0.3 |
||
Executes: |
12.2 |
9.7 |
||
Rollbacks: |
0.0 |
0.0 |
||
Transactions: |
1.3 |
从此项可以看出,physical reads 13438
physical writes 46.4,
也可以推断出,读的压力远远大于写,可以看出后台入库sqlldr的压力远远小于省部接口的读数据的压力。
Shared Pool Statistics
Begin |
End |
|
Memory Usage %: |
48.19 |
48.33 |
% SQL with executions>1: |
96.71 |
96.76 |
% Memory for SQL w/exec>1: |
91.31 |
89.65 |
消耗内存的sql突然增多,然后我从top5看出,hash_value的sql_id ,为省部接口的ps_xxxxlog的p2p业务。
select xxxxx
FROM (
SELECT /*+ no_index(t) */ TRUNC(STxME, 'hh24') TIME,SxAN,RxC,RxC,CxL,rxype,t.ax
,SUM(xxxs) Appxmes,SUM(Txxic) Traffic
FROM PS_XXX t, cfg_apn apn
WHERE T.STxIME >= TO_DATE('20120813 19:00:00', 'yyyymmdd hh24:mi:ss')
AND T.STARxME < TO_DATE('20120813 20:00:00', 'yyyymmdd hh24:mi:ss')
AND upper(t.xn)=upper(apn.apn)
GROUP BY TRUNC(STxIME, 'hh24'),Sx_AN,RxC,RxC,xL,raxpe,t.axn
) t ,CFG_DEVICEMAP SGSNxM,CFG_DEVxP RNxDM,CFG_DEVxAP RACDM,MV_DxALUES d
WHERE
t.SxN = SxN_AxxLUE(+)
.....
省略
该sql,没有apptype不说,肯定不行的,另外,请看它的where条件是starttime,问题就在这里,此表分区字段是endtime,所以跑不出来,内存耗尽。
下面看看
Top 5 Timed Foreground Events
Event |
Waits |
Time(s) |
Avg wait (ms) |
% DB time |
Wait Class |
direct path read |
1,128,304 |
117,891 |
104 |
78.01 |
User I/O |
db file scattered read |
119,308 |
23,548 |
197 |
15.58 |
User I/O |
DB CPU |
4,068 |
2.69 |
|||
db file sequential read |
15,750 |
2,072 |
132 |
1.37 |
User I/O |
direct path read temp |
6,439 |
1,134 |
176 |
0.75 |
User I/O |
更加印证了我的猜测,最严重的5个 wait,其中read占了4个。而且系统运行良好的时候应该是cpu time排第一。
其他的暂时不用看了,以上已经说明了有很大问题。即使是存储有问题。但是以上的分析也占了问题的70%以上。
明天就开始处理吧。调整一下省部接口。控制一下高峰期的sessions,监控一下内存状态。
今天早上再次检查新疆CS,错误操作系统错误日志如下:
ress: w5001438004c99e8d,1 is online Load balancing: round-robin
Aug 14 02:31:47 racdb1 scsi: [ID 243001 kern.info] /pci@1,700000/lpfc@0/fp@0,0 (fcp4):
Aug 14 02:31:47 racdb1 ndi_devi_online: failed for array-controller: target=11500 lun=0 ffffffff
Aug 14 02:31:48 racdb1 scsi: [ID 243001 kern.info] /pci@11,700000/lpfc@0/fp@0,0 (fcp3):
Aug 14 02:31:48 racdb1 ndi_devi_online: failed for array-controller: target=11500 lun=0 ffffffff
Aug 14 02:31:48 racdb1 scsi: [ID 243001 kern.info] /pci@1,700000/lpfc@0/fp@0,0 (fcp4):
Aug 14 02:31:48 racdb1 ndi_devi_online: failed for array-controller: target=11500 lun=0 ffffffff
Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g6001438007f3087000004000007c0000 (ssd88):
Aug 14 02:32:35 racdb1 Error for Command: write(10) Error Level: Retryable
Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.notice] Requested Block: 78104 Error Block: 0
Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.notice] Vendor: HP Serial Number: C70000040000
Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g6001438007f3087000004000007c0000 (ssd88):
Aug 14 02:32:35 racdb1 Error for Command: write(10) Error Level: Retryable
Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.notice] Requested Block: 78104 Error Block: 0
Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.notice] Vendor: HP Serial Number: C70000040000
Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
Aug 14 02:32:35 racdb1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g6001438007f308700000400000840000 (ssd90):
Aug 14 02:34:56 racdb1 Error for Command: write(10) Error Level: Retryable
Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.notice] Requested Block: 82424 Error Block: 0
Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.notice] Vendor: HP Serial Number: 480000040000
Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g6001438007f308700000400000840000 (ssd90):
Aug 14 02:34:56 racdb1 Error for Command: write(10) Error Level: Retryable
Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.notice] Requested Block: 82424 Error Block: 0
Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.notice] Vendor: HP Serial Number: 480000040000
Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
Aug 14 02:34:56 racdb1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Aug 14 06:05:29 racdb1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g6001438007f308700000400000800000 (ssd91):
Aug 14 06:05:29 racdb1 Error for Command: read(10) Error Level: Retryable
Aug 14 06:05:29 racdb1 scsi: [ID 107833 kern.notice] Requested Block: 83968 Error Block: 0
Aug 14 06:05:29 racdb1 scsi: [ID 107833 kern.notice] Vendor: HP Serial Number: 080000040000
Aug 14 06:05:29 racdb1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
Aug 14 06:05:29 racdb1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Aug 14 06:05:30 racdb1 scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g6001438007f308700000400000800000 (ssd91):
Aug 14 06:05:30 racdb1 Error for Command: read(10) Error Level: Retryable
Aug 14 06:05:30 racdb1 scsi: [ID 107833 kern.notice] Requested Block: 83968 Error Block: 0
Aug 14 06:05:30 racdb1 scsi: [ID 107833 kern.notice] Vendor: HP Serial Number: 080000040000
Aug 14 06:05:30 racdb1 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
Aug 14 06:05:30 racdb1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
该错误,在yingtingkun 的bolg 也看到过,老杨很无语,说了一句:
“很显然,正是由于操作系统或者硬件上的故障,导致了这个错误的产生”
。
现在应用已经调整,看看,今天下午的情况。
后续......(下午再写)
Top 5 Timed Foreground Events
Event |
Waits |
Time(s) |
Avg wait (ms) |
% DB time |
Wait Class |
direct path read |
1,128,304 |
117,891 |
104 |
78.01 |
User I/O |
db file scattered read |
119,308 |
23,548 |
197 |
15.58 |
User I/O |
DB CPU |
4,068 |
2.69 |
|||
db file sequential read |
15,750 |
2,072 |
132 |
1.37 |
User I/O |
direct path read temp |
6,439 |
1,134 |
176 |
0.75 |
User I/O |
调整后的 top5
Top 5 Timed Foreground Events
Event | Waits | Time(s) | Avg wait (ms) | % DB time | Wait Class |
---|---|---|---|---|---|
DB CPU | 506 | 70.60 | |||
direct path read | 8,243 | 156 | 19 | 21.71 | User I/O |
db file scattered read | 2,036 | 12 | 6 | 1.69 | User I/O |
SQL*Net break/reset to client | 1,236 | 11 | 9 | 1.57 | Application |
db file sequential read | 1,915 | 10 | 5 | 1.34 | User I/O |
Snap Id | Snap Time | Sessions | Cursors/Session | |
---|---|---|---|---|
Begin Snap: | 7058 | 17-Aug-12 17:00:26 | 246 | 1.2 |
End Snap: | 7059 | 17-Aug-12 18:00:02 | 216 | 1.1 |
Elapsed: | 59.59 (mins) | |||
DB Time: | 11.95 (mins) |
DB TIME=DB CPU+Foreground wait time+Cpu on queue;
每个cpu 的可用处理时间是3575.4 s, 11.95/(59.59*48)=0.0041,说明系统负载几乎不存在,空闲。
direct path read较高的可能原因有:
1. 大量的磁盘排序操作,order by, group by, union, distinct, rollup, 无法在PGA中完成排序,需要利用temp表空间进行排序。 当从临时表空间中读取排序结果时,会产生direct path read.
2. 大量的Hash Join操作,利用temp表空间保存hash区。
3. SQL语句的并行处理
4. 大表的全表扫描,在中,全表扫描的算法有新的变化,根据表的大小、高速缓存的大小等信息,决定是否绕过SGA直接从磁盘读Oracle11g取数据
而在我们xx这种OLAP环境下:
1.group by ,order by,union
2.SQL语句的并行处理
3. 大表的全表扫描
以上3种,经常可见。所以direct path read 在某个时刻异常偏高。绕过buffer cache,这样的全表扫描就是物理读了。导致IO 压力大,存储崩溃。
调整前:
User I/O Time (s) | Executions | UIO per Exec (s) | %Total | Elapsed Time (s) | %CPU | %IO | SQL Id | SQL Module | SQL Text |
---|---|---|---|---|---|---|---|---|---|
27,684.05 | 0 | 19.10 | 28,929.23 | 4.20 | 95.70 | c2mfjtnub94s9 | PL/SQL Developer | select /*+ parallel(dt, 8) */ ... | |
14,921.78 | 1 | 14,921.78 | 10.30 | 15,128.51 | 1.41 | 98.63 | fyd2m5r2r0h1v | PL/SQL Developer | select /*+ parallel(dt, 8) */ ... |
8,823.28 | 1 | 8,823.28 | 6.09 | 9,012.14 | 2.15 | 97.90 | f8462t1hfuprf | PL/SQL Developer | ---??????? -------------------... |
6,548.98 | 1 | 6,548.98 | 4.52 | 6,648.51 | 1.45 | 98.50 | 3nb6rxfbrhqdk | PL/SQL Developer | select /*+ parallel(dt, 8) */ ... |
5,832.73 | 1 | 5,832.73 | 4.02 | 6,073.87 | 3.94 | 96.03 | aa2usv1skn956 | PL/SQL Developer | select /*+ parallel(dt, 8) */ ... |
3,553.17 | 0 | 2.45 | 3,616.40 | 1.74 | 98.25 | a25hq3mq7gk8p | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,552.25 | 0 | 2.45 | 3,616.62 | 1.79 | 98.22 | aj0yb5wrudxha | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,552.22 | 0 | 2.45 | 3,616.53 | 1.79 | 98.22 | fs1uud29zrqwp | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,552.07 | 0 | 2.45 | 3,616.48 | 1.77 | 98.22 | 0ty0c58vgdgz5 | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,551.84 | 0 | 2.45 | 3,616.49 | 1.80 | 98.21 | gwhagwnqp38gy | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,551.49 | 0 | 2.45 | 3,616.45 | 1.78 | 98.20 | f09j416qpvhms | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,551.47 | 0 | 2.45 | 3,616.61 | 1.78 | 98.20 | bsztwnq8puqm1 | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,551.04 | 0 | 2.45 | 3,616.56 | 1.76 | 98.19 | cdjyqhm9tn0ak | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,551.01 | 0 | 2.45 | 3,616.55 | 1.82 | 98.19 | gz4w2u8k1zjgx | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,550.82 | 0 | 2.45 | 3,616.54 | 1.81 | 98.18 | 6x4vgyfz9z011 | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,550.33 | 0 | 2.45 | 3,616.38 | 1.85 | 98.17 | 17jggncfp816r | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,550.27 | 0 | 2.45 | 3,616.41 | 1.82 | 98.17 | 70gz8cajpf3s4 | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,550.18 | 0 | 2.45 | 3,616.49 | 1.84 | 98.17 | 568h19gn1v942 | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,549.66 | 0 | 2.45 | 3,616.40 | 1.79 | 98.15 | b58g6ur474c2y | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,548.89 | 0 | 2.45 | 3,616.43 | 1.84 | 98.13 | dnb58gxnxh6ru | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,548.60 | 0 | 2.45 | 3,616.23 | 1.85 | 98.13 | 1g58wy4nprp6c | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,548.23 | 0 | 2.45 | 3,616.35 | 1.82 | 98.12 | 3y7zh3ag4wh5w | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,547.75 | 0 | 2.45 | 3,616.49 | 1.87 | 98.10 | 0ay5juvzsqqvu | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,547.43 | 0 | 2.45 | 3,616.49 | 1.85 | 98.09 | f9vaxg3kd6sc8 | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... | |
3,546.20 | 0 | 2.45 | 3,616.38 | 1.87 | 98.06 | dmxgtdjqf88qa | JDBC Thin Client | SELECT to_char(TIME, 'yyyy-mm-... |
调整后:
User I/O Time (s) | Executions | UIO per Exec (s) | %Total | Elapsed Time (s) | %CPU | %IO | SQL Id | SQL Module | SQL Text |
---|---|---|---|---|---|---|---|---|---|
77.77 | 1 | 77.77 | 43.38 | 116.51 | 33.41 | 66.75 | 28yaqv6bnb6xc | PL/SQL Developer | select /*+ parallel(dt, 8) */ ... |
72.03 | 1 | 72.03 | 40.18 | 111.66 | 34.93 | 64.51 | 0zq83gnkbgyrp | PL/SQL Developer | select /*+ parallel(dt, 8) */ ... |
10.84 | 2 | 5.42 | 6.05 | 317.02 | 95.90 | 3.42 | ddgcvawy1q8ba | [3600:?????????] | SELECT /*+ parallel(8) index(t... |
10.01 | 1 | 10.01 | 5.58 | 18.57 | 44.05 | 53.88 | 5r3r7thkjsx94 | PL/SQL Developer | select trunc(starttime, 'hh24'... |
5.68 | 3 | 1.89 | 3.17 | 6.39 | 8.29 | 88.94 | 9k75s8j10g6v7 | SELECT /* OPT_DYN_SAMP */ /*+ ... | |
5.45 | 1 | 5.45 | 3.04 | 6.50 | 7.85 | 83.97 | 6q53sucwnb9sr | i-Signal.exe | select /*+ parallel(4) first_r... |
5.13 | 2 | 2.57 | 2.86 | 5.88 | 5.61 | 87.23 | 130dvvr5s8bgn | select obj#, dataobj#, part#, ... | |
3.80 | 1 | 3.80 | 2.12 | 4.28 | 8.89 | 88.82 | 48vbtndvcbhju | SELECT /* OPT_DYN_SAMP */ /*+ ... | |
2.11 | 1 | 2.11 | 1.18 | 2.17 | 2.31 | 97.22 | 350myuyx0t1d6 | insert into wrh$_tablespace_st... | |
0.22 | 1 | 0.22 | 0.13 | 0.73 | 45.05 | 30.66 | 6ajkhukk78nsr | begin prvt_hdm.auto_execute( :... |
问题已经搞定一部分,如果有时间,我们再继续研究下面的内容吧,wait event 等等,已经不早了,明天继续吧。
注意:省部接口的程序还有调整的可能性,parallel的 系数大小。并且应用要考虑和其他忙时的应用错开运行。
还有我看了看,新疆网优SUN OS 总共内存为128G ,那么我们的数据库设置sga_max_size=76G sga_target=28G,其实这么设置有点紧,sga_max_size 一般为60% ,但是这个60%并不是 128*0.6=76.8 G, 应该排除OS 所用20% 之后的,60%。 (128-128*0.2)*0.6= 61.4G
sga_target 在11g 新特性中,应该在设置后,重启后,与sga_max_size相等。理论是当SGA_TARGET < SGA_MAX_SIZE时,实例重启以后SGA_MAX_SIZE就变成SGA_TARGET的大小。但是我看看新疆的状态是sga_target=28G。
SGA
* Buffer cache (DB_CACHE_SIZE)
* Shared pool (SHARED_POOL_SIZE)
* Large pool (LARGE_POOL_SIZE)
* Java pool (JAVA_POOL_SIZE)
* Streams pool (STREAMS_POOL_SIZE)