昨天一套核心库出现故障,几百个session堵塞,最后宕机,由于数据库是客服库,很敏感,所以办公室顿时嘈杂起来。
我们处理这类故障的原则是先恢复业务,其它的都随后在展开,在随后的分析工作中,走了点弯路,下面我把整个故障的情况展现给大家。
首先,在alert里面发现了ora-01555,这个是从7点左右开始的,一直持续到下午2点都有这样的事件.
Tue May 22 07:11:56 2012 ORA-01555 caused by SQL statement below (SQL ID: gd31z0zdh33nj, Query Duration=101 sec, SCN: 0x0b3c.37ef1c98) Tue May 22 08:45:34 2012 Errors in file /u01/oracle/app/oracle/admin/crmkf/bdump/crmkf1_smon_8004.trc: ORA-30036: unable to extend segment by 4 in undo tablespace 'UNDOTBS1'
直到八点多报出ORA-30036,在随后一直有ORA-01555,的确当时UNDO消耗殆尽,大家都走入了误区,
把数据库Hang都集中在undo相关的因素上,在这个基础上去找原因,但是最后也没啥进展。
看看如下的统计,这是在事故发生的前几分钟的信息:
ORA-01555 caused by SQL statement below (SQL ID: gd31z0zdh33nj, Query Duration=101 sec, SCN: 0x0b3c.37ef1c98): ORA-01555 caused by SQL statement below (SQL ID: 18hwx0w69f9pg, Query Duration=107 sec, SCN: 0x0b3c.381970a9): ORA-01555 caused by SQL statement below (SQL ID: 18hwx0w69f9pg, Query Duration=152 sec, SCN: 0x0b3c.380ff215): ORA-01555 caused by SQL statement below (SQL ID: 18hwx0w69f9pg, Query Duration=73 sec, SCN: 0x0b3c.3822c0ef): ORA-01555 caused by SQL statement below (SQL ID: dx2gpf5jk1jnu, Query Duration=113 sec, SCN: 0x0b3c.38e2030f): ORA-01555 caused by SQL statement below (SQL ID: f1v5vww09yzzd, Query Duration=170 sec, SCN: 0x0b3c.3a53f9b0): ORA-01555 caused by SQL statement below (SQL ID: 948xj46pvh0j4, Query Duration=28 sec, SCN: 0x0b3c.3a84c77d): ORA-01555 caused by SQL statement below (SQL ID: gd31z0zdh33nj, Query Duration=186 sec, SCN: 0x0b3c.3a7d6206): ORA-01555 caused by SQL statement below (SQL ID: gd31z0zdh33nj, Query Duration=226 sec, SCN: 0x0b3c.3a76b4ab): ORA-01555 caused by SQL statement below (SQL ID: dx2gpf5jk1jnu, Query Duration=29 sec, SCN: 0x0b3c.3b871228): ORA-01555 caused by SQL statement below (SQL ID: 02pnhfvmwua8r, Query Duration=37 sec, SCN: 0x0b3c.3be2e941): ORA-01555 caused by SQL statement below (SQL ID: 05m5p2na5f8yd, Query Duration=6 sec, SCN: 0x0b3c.3c17f5fa): ORA-01555 caused by SQL statement below (SQL ID: 1xj2gpb5x1a38, Query Duration=6 sec, SCN: 0x0b3c.3c24d864): ORA-01555 caused by SQL statement below (SQL ID: 4qcd9jnyt4j27, Query Duration=66 sec, SCN: 0x0b3c.3c4c2d04): ORA-01555 caused by SQL statement below (SQL ID: 2xp5nc6rmd42t, Query Duration=15 sec, SCN: 0x0b3c.3c63797d): ORA-01555 caused by SQL statement below (SQL ID: 8r4xzv0amrkyb, Query Duration=64 sec, SCN: 0x0b3c.3c5c7c15): ORA-01555 caused by SQL statement below (SQL ID: 8v307x8qzc78y, Query Duration=12 sec, SCN: 0x0b3c.3d084c98):
很短小的事务也失败,这个有点不可理解,duration=6的也报出ora-01555。
再来看看当时数据库的等待事件:
col samp_time for a25 col event for a30 select session_id sid, to_char(SAMPLE_TIME,'YY-MM-DD:HH:MI:SS') samp_time, session_state, event, p1, p2, p3, wait_time, time_waited from DBA_HIST_ACTIVE_SESS_HISTORY where to_char(SAMPLE_TIME,'YY-MM-DD:HH24:MI:SS') between '12-05-22:14:45:00' and '12-05-22:14:55:00' --and event like'cursor%' and p1=2057858921 order by sample_id desc; SID SAMP_TIME SESSION_STATE EVENT P1 P2 P3 TIME_WAITED ---------- ------------------------- ------------- ---------------------- ---------- ---------- ---------- ----------- 5608 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986328 5613 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986320 5615 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986324 5616 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986314 5619 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986327 5621 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986321 5622 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986311 5623 12-05-22:02:52:30 WAITING enq: FB - contention 1178730502 5 42451933 498021 5625 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986323 5627 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986321 5631 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986322 5633 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986323 5634 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986327 5637 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986308 5638 12-05-22:02:52:30 WAITING enq: FB - contention 1178730502 5 42451933 498022 5640 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986328 5641 12-05-22:02:52:30 WAITING log file sync 4294967295 0 0 986328 5642 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986329 5656 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986315 5658 12-05-22:02:52:30 WAITING log file sync 2734 0 0 986278
这里只摘录了一部分,看看等待的统计信息:
SQL> select event,count(*) from 2 dba_hist_active_sess_history 3 where to_char(SAMPLE_TIME,'YY-MM-DD:HH24:MI:SS') between 4 '12-05-22:14:45:00' and '12-05-22:14:55:00' 5 group by event 6 order by 2 desc; EVENT COUNT(*) --------------------------------- -------- log file sync 12370 enq: FB - contention 1018 buffer busy waits 497 db file sequential read 290 db file parallel write 231 direct path read 165 control file sequential read 92 enq: HW - contention 79 log file parallel write 27 control file parallel write 25 read by other session 13 gcs log flush sync 2 wait for scn ack 1
当时数据库基本都在等待log file sync,enq:FB-contention,尤其log file sync等待的时间是s级别。这个基本可以断定IO肯定有问题了,
于是查看系统log,有了实质性的发现,导致redo写入异常,从而引发数据库的不正常,废话少说了,看下面log:
May 22 14:51:14 zjddkf02 vmunix: LVM: WARNING: VG 64 0x090000: LV 3: Some I/O requests to this LV are waiting May 22 14:51:14 zjddkf02 vmunix: LVM: WARNING: VG 64 0x140000: LV 10: Some I/O requests to this LV are waiting May 22 14:51:18 zjddkf02 vmunix: LVM: WARNING: VG 64 0x090000: LV 1: Some I/O requests to this LV are waiting May 22 14:51:19 zjddkf02 vmunix: LVM: WARNING: VG 64 0x140000: LV 12: Some I/O requests to this LV are waiting May 22 14:51:26 zjddkf02 vmunix: LVM: WARNING: VG 64 0x110000: LV 18: Some I/O requests to this LV are waiting May 22 14:51:26 zjddkf02 vmunix: LVM: VG 64 0x170000: PVLink 1 0x000078 Failed! The PV is not accessible. May 22 14:51:29 zjddkf02 vmunix: LVM: VG 64 0x120000: PVLink 1 0x000035 Failed! The PV is not accessible. May 22 14:51:26 zjddkf02 vmunix: indefinitely for an unavailable PV. These requests will be queued until May 22 14:51:38 zjddkf02 above message repeats 25 times May 22 14:51:38 zjddkf02 vmunix: LVM: WARNING: VG 64 0x190000: LV 2: Some I/O requests to this LV are waiting May 22 14:51:26 zjddkf02 vmunix: the PV becomes available (or a timeout is specified for the LV). May 22 14:51:38 zjddkf02 above message repeats 25 times May 22 14:51:38 zjddkf02 vmunix: indefinitely for an unavailable PV. These requests will be queued until May 22 14:51:38 zjddkf02 vmunix: the PV becomes available (or a timeout is specified for the LV). May 22 14:51:39 zjddkf02 vmunix: LVM: VG 64 0x090000: PVLink 1 0x000054 Failed! The PV is not accessible. May 22 14:51:41 zjddkf02 vmunix: LVM: WARNING: VG 64 0x190000: LV 14: Some I/O requests to this LV are waiting May 22 14:51:47 zjddkf02 vmunix: LVM: VG 64 0x130000: PVLink 1 0x000037 Failed! The PV is not accessible. May 22 14:51:47 zjddkf02 vmunix: LVM: VG 64 0x130000: PVLink 1 0x000038 Failed! The PV is not accessible. May 22 14:51:41 zjddkf02 vmunix: indefinitely for an unavailable PV. These requests will be queued until May 22 14:51:47 zjddkf02 vmunix: LVM: VG 64 0x130000: PVLink 1 0x00003c Failed! The PV is not accessible. May 22 14:51:41 zjddkf02 vmunix: the PV becomes available (or a timeout is specified for the LV). May 22 14:51:47 zjddkf02 vmunix: LVM: VG 64 0x130000: PVLink 1 0x000044 Failed! The PV is not accessible. May 22 14:51:49 zjddkf02 cmdisklockd[4459]: Still trying to inquire cluster lock disk /dev/disk/disk274 May 22 14:51:59 zjddkf02 vmunix: LVM: VG 64 0x180000: PVLink 1 0x000083 Failed! The PV is not accessible. May 22 14:51:59 zjddkf02 vmunix: LVM: VG 64 0x180000: PVLink 1 0x000085 Failed! The PV is not accessible. ... May 22 14:55:37 zjddkf02 above message repeats 4 times May 22 14:55:37 zjddkf02 vmunix: LVM: WARNING: VG 64 0x110000: LV 19: Some I/O requests to this LV are waiting May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x090000: PVLink 1 0x000054 Recovered. May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x090000: LV 1: All I/O requests to this LV that were May 22 14:55:37 zjddkf02 vmunix: indefinitely for an unavailable PV. These requests will be queued until May 22 14:55:38 zjddkf02 vmunix: waiting indefinitely for an unavailable PV have now completed. May 22 14:55:37 zjddkf02 vmunix: the PV becomes available (or a timeout is specified for the LV). May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x090000: LV 3: All I/O requests to this LV that were May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x090000: LV 5: All I/O requests to this LV that were May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x090000: LV 4: All I/O requests to this LV that were May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000045 Recovered. May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000046 Recovered. May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000047 Recovered. May 22 14:55:38 zjddkf02 vmunix: waiting indefinitely for an unavailable PV have now completed. May 22 14:55:38 zjddkf02 above message repeats 3 times May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000049 Recovered. May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000050 Recovered. May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000053 Recovered. May 22 14:55:38 zjddkf02 vmunix: LVM: VG 64 0x140000: PVLink 1 0x000051 Failed! The PV is not accessible. May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x140000: LV 5: All I/O requests to this LV that were May 22 14:55:38 zjddkf02 vmunix: waiting indefinitely for an unavailable PV have now completed. May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x140000: LV 16: All I/O requests to this LV that were May 22 14:55:38 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x140000: LV 2: All I/O requests to this LV that were May 22 14:55:38 zjddkf02 vmunix: LVM: WARNING: VG 64 0x140000: LV 12: Some I/O requests to this LV are waiting May 22 14:55:38 zjddkf02 vmunix: indefinitely for an unavailable PV. These requests will be queued until May 22 14:55:38 zjddkf02 vmunix: the PV becomes available (or a timeout is specified for the LV). May 22 14:55:38 zjddkf02 vmunix: waiting indefinitely for an unavailable PV have now completed. May 22 14:55:38 zjddkf02 vmunix: LVM: WARNING: VG 64 0x190000: LV 14: Some I/O requests to this LV are waiting May 22 14:55:38 zjddkf02 vmunix: LVM: WARNING: VG 64 0x130000: LV 16: Some I/O requests to this LV are waiting May 22 14:55:38 zjddkf02 vmunix: indefinitely for an unavailable PV. These requests will be queued until May 22 14:55:38 zjddkf02 above message repeats 2 times May 22 14:55:38 zjddkf02 vmunix: the PV becomes available (or a timeout is specified for the LV). May 22 14:55:38 zjddkf02 above message repeats 2 times May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x150000: PVLink 1 0x000056 Recovered. May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x150000: PVLink 1 0x000057 Failed! The PV is not accessible. May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x150000: PVLink 1 0x000058 Failed! The PV is not accessible. May 22 14:55:39 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x150000: LV 29: All I/O requests to this LV that were May 22 14:55:39 zjddkf02 vmunix: waiting indefinitely for an unavailable PV have now completed. May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x120000: PVLink 1 0x000027 Recovered. May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x120000: PVLink 1 0x000028 Recovered. May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x120000: PVLink 1 0x000029 Recovered. May 22 14:55:39 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x120000: LV 27: All I/O requests to this LV that were May 22 14:55:39 zjddkf02 vmunix: waiting indefinitely for an unavailable PV have now completed. May 22 14:55:39 zjddkf02 vmunix: LVM: NOTICE: VG 64 0x130000: LV 16: All I/O requests to this LV that were May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x180000: PVLink 1 0x000083 Recovered. May 22 14:55:39 zjddkf02 vmunix: LVM: VG 64 0x180000: PVLink 1 0x000084 Recovered. May 22 14:55:39 zjddkf02 vmunix: waiting indefinitely for an unavailable PV have now completed.
摘录部分信息,从14:51:14开始出现问题,pv不可得,物理链路丢失,但是到14:55:39已经recovered完成。唉,悲催的,有人动了SAN的相关配置文件。
又是误操作,,,,
再来说说UNDO tablespace,其实我们的UNDO完全是够用的,看看现在的配置:
SQL> show parameter undo NAME TYPE VALUE ------------------------------------ -------------------------------- ------------------------------ _gc_undo_affinity boolean FALSE undo_management string AUTO undo_retention integer 36000 undo_tablespace string UNDOTBS1 SQL> SELECT d.undo_size/(1024*1024) "ACTUAL UNDO SIZE [MByte]", 2 SUBSTR(e.value,1,25) "UNDO RETENTION [Sec]", 3 (TO_NUMBER(e.value) * TO_NUMBER(f.value) * 4 g.undo_block_per_sec) / (1024*1024) 5 "NEEDED UNDO SIZE [MByte]" 6 FROM ( 7 SELECT SUM(a.bytes) undo_size 8 FROM v$datafile a, 9 v$tablespace b, 10 dba_tablespaces c 11 WHERE c.contents = 'UNDO' 12 AND c.status = 'ONLINE' 13 AND b.name = c.tablespace_name 14 AND a.ts# = b.ts# 15 ) d, 16 v$parameter e, 17 v$parameter f, 18 ( 19 SELECT MAX(undoblks/((end_time-begin_time)*3600*24)) 20 undo_block_per_sec 21 FROM v$undostat 22 ) g 23 WHERE e.name = 'undo_retention' 24 AND f.name = 'db_block_size' 25 / ACTUAL UNDO SIZE [MByte] UNDO RETENTION [Sec] NEEDED UNDO SIZE [MByte] ------------------------ ----------------------------------- ------------------------ 96000 36000 35242.5
要保证undo retention需要的表空间大小事35242.5M,我们的UNDO表空间大小是96000M,为什么早晨会开始报ORA-01555呢,
经过和应用沟通,最近他们把本来晚上跑的批处理,报表,放在白天开始跑。隐隐蛋疼的感觉,这么大的操作,也能随意调时间,
白天业务忙,事务量高,一个报表都要跑1-2小时,不报01555才怪呢。
汇报到这里吧,IT系统有风险,操作需谨慎,可惜误操作的哥们,撞到雷了,,,,一定要小心,小心,再小心