这个是一个双节点的RAC环境,前天晚上做了IPL,结果由于没有仔细检查,导致两个节点上DG的到灾备的Standby节点上的参数稍微有点不一致,即节点1上是使用async模式传输同步日志,而在节点2上却是以sync方式传输同步日志。
这样就有问题了,因为以sync方式同步是Primary节点需要等待Standby节点写返回,而async方式时不需要。如此则带来正面一个问题,比如节点2当前提交的数据的scn是100,而节点1上提交的数据的scn是90,那么当节点2的数据同步到Standby节点并应用到Standby节点库时,Standby节点会发现此时数据库的scn为90的数据还没被应用(因为节点1是以async方式同步日志的),但是节点2却是以sync方式传输,提交需要等待写返回。如果Standby节点要等到节点1的日志自动应用(如日志切换)之后才能应用节点2的日志,那么这段时间节点2的提交相当于被hang住挂起了,这个过程可能很久,这样是不允许的。
那么Oracle怎么办呢?好办啊,它让节点1去进行日志切换不久可以了吗?这样节点1的日志不是可以apply了吗?紧跟着节点2的日志不是也就可以被apply了吗?这样数据库的提交操作是不会被hang住了,但是带来的另外一个为他就是:数据库日志切换及其频繁,几乎1分钟就切换1-2次!随便截取日志如下:
Fri Mar 23 12:10:53 2012 Thread 1 advanced to log sequence 8069 (LGWR switch) Current log# 2 seq# 8069 mem# 0: /dev/rint_redo1_2a Current log# 2 seq# 8069 mem# 1: /dev/rint_redo1_2b Fri Mar 23 12:10:53 2012 LNS: Standby redo logfile selected for thread 1 sequence 8069 for destination LOG_ARCHIVE_DEST_3 Fri Mar 23 12:11:47 2012 Thread 1 cannot allocate new log, sequence 8070 Checkpoint not complete Current log# 2 seq# 8069 mem# 0: /dev/rint_redo1_2a Current log# 2 seq# 8069 mem# 1: /dev/rint_redo1_2b LGWR: Standby redo logfile selected to archive thread 1 sequence 8070 LGWR: Standby redo logfile selected for thread 1 sequence 8070 for destination LOG_ARCHIVE_DEST_2 Fri Mar 23 12:11:53 2012 Thread 1 advanced to log sequence 8070 (LGWR switch) Current log# 3 seq# 8070 mem# 0: /dev/rint_redo1_3a Current log# 3 seq# 8070 mem# 1: /dev/rint_redo1_3b Fri Mar 23 12:11:54 2012 LNS: Standby redo logfile selected for thread 1 sequence 8070 for destination LOG_ARCHIVE_DEST_3 Fri Mar 23 12:12:47 2012 Thread 1 cannot allocate new log, sequence 8071 Checkpoint not complete Current log# 3 seq# 8070 mem# 0: /dev/rint_redo1_3a Current log# 3 seq# 8070 mem# 1: /dev/rint_redo1_3b LGWR: Standby redo logfile selected to archive thread 1 sequence 8071 LGWR: Standby redo logfile selected for thread 1 sequence 8071 for destination LOG_ARCHIVE_DEST_2 Fri Mar 23 12:12:54 2012 Thread 1 advanced to log sequence 8071 (LGWR switch) Current log# 4 seq# 8071 mem# 0: /dev/rint_redo1_4a Current log# 4 seq# 8071 mem# 1: /dev/rint_redo1_4b Fri Mar 23 12:12:54 2012 LNS: Standby redo logfile selected for thread 1 sequence 8071 for destination LOG_ARCHIVE_DEST_3 |
日志切换频繁肯定是不行的了,首先不说性能问题:
Thread 1 cannot allocate new log, sequence 8072
Checkpoint not complete
这可能会引起一些Bug之类的问题,如ORA-00600或ORA-07445,果不其然数据库运行到下午当出现那种批量交易的时候就报错了:
Fri Mar 23 14:52:42 2012 Errors in file /soft/oracle/admin/int/udump/int1_ora_1118458.trc: ORA-00600: internal error code, arguments: [kope2_readstr232], [1], [], [], [], [], [], [] Fri Mar 23 14:52:44 2012 Trace dumping is performing id=[cdmp_20120323145244] Fri Mar 23 14:52:45 2012 Errors in file /soft/oracle/admin/int/udump/int1_ora_1118458.trc: ORA-07445: exception encountered: core dump [_ptrgl] [SIGSEGV] [Invalid permissions for mapped object] [0x004280283] [] [] Fri Mar 23 14:57:00 2012 Errors in file /soft/oracle/admin/int/udump/int1_ora_3678348.trc: ORA-00600: internal error code, arguments: [kope2_readstr232], [1], [], [], [], [], [], [] Fri Mar 23 14:57:02 2012 Trace dumping is performing id=[cdmp_20120323145702] Fri Mar 23 15:03:04 2012 Errors in file /soft/oracle/admin/int/udump/int1_ora_3678348.trc: ORA-07445: exception encountered: core dump [_ptrgl] [SIGSEGV] [Invalid permissions for mapped object] [0x004280283] [] [] Fri Mar 23 15:03:05 2012 Trace dumping is performing id=[cdmp_20120323150305] |
trace文件1118458.trc截取:
/soft/oracle/admin/int/udump/int1_ora_1118458.trc Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP, Data Mining and Real Application Testing options ORACLE_HOME = /soft/oracle/product/10.2.0/db_1 System name: AIX Node name: racdb1 Release: 3 Version: 5 Machine: 00C0D0A44C00 Instance name: int1 Redo thread mounted by this instance: 1 Oracle process number: 174 Unix process pid: 1118458, image: oracle@racdb1 *** ACTION NAME:() 2012-03-23 04:17:11.000 *** MODULE NAME:(java@esb2 (TNS V1-V3)) 2012-03-23 04:17:11.000 *** SERVICE NAME:(int) 2012-03-23 04:17:11.000 *** SESSION ID:(2087.10) 2012-03-23 04:17:11.000 kxfpg1srv could not start PZ99, inst 2 kxfpg1sg GV$ query failed to get slave on inst 2 *** 2012-03-23 14:52:42.192 ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [kope2_readstr232], [1], [], [], [], [], [], [] Current SQL statement for this session: begin dbms_aqin.aq$_enqueue_obj(queue_name => :1, sender_name => :2, sender_addr => :3, sender_protocol => :4, original_msgid => :5, correlation => :6, visibility => :7, priority=> :8, delay => :9, expiration=> :10, relative_msgid => :11, sequence_deviation => :12, exception_queue => :13, payload_type => :14, raw_user_data => null, object_user_data => :15, msgid => :16, recipients => :17, signature => :18, transformation => :19, delivery_mode =>:20); end; ----- PL/SQL Call Stack ----- object line object handle number name 70000080e2621d8 1 anonymous block ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- ksedst+001c bl ksedst1 000000000 ? FFFFFFFFFFEADC4 ? ksedmp+0290 bl ksedst 104A43B68 ? ksfdmp+0018 bl 03F35A00 kgerinv+00dc bl _ptrgl kgesinv+0020 bl kgerinv 10044E8A8 ? 000000000 ? 000000001 ? 000000000 ? 000000001 ? kgesin+003c bl kgesinv FFFFFFFFFFEF7B0 ? 11044E964 ? FFFFFFFFFFEB2F8 ? FFFFFFFFFFEB2F0 ? FFFFFFFFFFEBA60 ? kopp2upic+040c bl kgesin 110195408 ? 110430040 ? 104BCDB28 ? 100000001 ? 000000000 ? 000000001 ? B38F0000000001 ? 104A3D358 ? kope2upic2+1738 bl kopp2upic 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? FFFFFFFFFFEB320 ? 000000000 ? kodpunp+1914 bl kope2upic2 101C2AB68 ? 210195408 ? 1101B1A80 ? FFFFFFFFFFEB8F0 ? FFFFFFFFFFEB970 ? FFFFFFFFFFEB9C8 ? FFFFFFFFFFEB8A0 ? FFFFFFFFFFEB8B0 ? kopu2upkl+06ac bl kodpunp 110195408 ? 000000000 ? 000000000 ? FFFFFFFFFFF1C58 ? FFFFFFFFFFEF878 ? 70000078731F6B6 ? 70000078731F6B6 ? 70000078731F773 ? kopp2uattr+01f4 bl _ptrgl kopp2ucoll+17fc bl kopp2uattr 110195408 ? 000000000 ? 110561300 ? FFFFFFFFFFEFE10 ? FFFFFFFFFFEFE10 ? 4422208200000000 ? 10044CBB0 ? 1109F2F90 ? kopp2upic+015c bl kopp2ucoll FFFFFFFFFFF0B00 ? 110000388 ? FFFFFFFFFFEFC20 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? kope2upic2+1738 bl kopp2upic 101098264 ? 000000241 ? 09E370001 ? 700000805840B70 ? 000000009 ? 000000000 ? 000000000 ? 000000000 ? kope2upic+0110 bl kope2upic2 FFFFFFFFFFF0750 ? 8448220010008000 ? 10124031C ? 000000000 ? FFFFFFFFFFF0750 ? 000000001 ? FFFFFFFFFFF0730 ? 11022A3D0 ? pmucp2upkl+04d0 bl kope2upic 000000002 ? 00000000A ? 000000001 ? 000000001 ? 000010000 ? 00000005C ? 000000001 ? 000000020 ? kopu2upkl+0444 bl pmucp2upkl FFFFFFFFFFF1B38 ? FFFFFFFFFFF2158 ? 8500000000 ? 110569418 ? FB00000000000030 ? 70000078731F68E ? 70000078731F68E ? 000000000 ? kopp2uattr+01f4 bl _ptrgl kopp2upic+08e0 bl kopp2uattr 000000000 ? 000000000 ? 000000000 ? 000000000 ? FFFFFFFFFFF1950 ? 24224240F1463AF0 ? FFFFFFFFFFF6630 ? 110437390 ? kope2upic2+1738 bl kopp2upic 000000000 ? 110540140 ? 200000002 ? 110436158 ? 7000007EF84AC18 ? 11043FA38 ? FFFFFFFFFFF17A0 ? FFFFFFFFFFF17A8 ? kodpunp+1914 bl kope2upic2 110195408 ? 000000000 ? 000000000 ? 000000000 ? 000000001 ? 000000006 ? 000000000 ? 1109F2F90 ? kokoupkl+0fb0 bl kodpunp 110195408 ? 000000000 ? 000000000 ? 70000078731F920 ? FFFFFFFFFFF5E70 ? 70000078731F588 ? 70000078731F588 ? 70000078731F800 ? kpcocaup+0340 bl _ptrgl psdgbaa+0858 bl 03F36944 psdgba+0080 bl psdgbaa 7000000100311C8 ? 1101E7150 ? FFFFFFFFFFF65F8 ? 000005280 ? 000000000 ? 110580468 ? 000000000 ? 000000019 ? pfrfd2_init_binds+0 bl _ptrgl 5fc pfrfd_init_frame+00 bl pfrfd2_init_binds 18B000110195408 ? 010008110 ? ec 104C763CC ? 000000000 ? 000000000 ? 110284190 ? 004220283 ? pfrinstr_INFR+004c bl pfrfd_init_frame 101F83D28 ? 000000001 ? 700000787231A5B ? pfrrun_no_tool+005c bl _ptrgl pfrrun+1014 bl pfrrun_no_tool FFFFFFFFFFF6AA0 ? 110456BA0 ? 110415FE0 ? plsql_run+06b4 bl pfrrun 11057F8F0 ? peicnt+0224 bl plsql_run 11057F8F0 ? 1000000000000 ? 000000000 ? kkxexe+0250 bl peicnt FFFFFFFFFFF7DB8 ? 11057F8F0 ? opiexe+2ef8 bl kkxexe 110580468 ? kpoal8+0edc bl opiexe FFFFFFFFFFFB3D4 ? FFFFFFFFFFFAFC8 ? FFFFFFFFFFF95A8 ? opiodr+0ae0 bl _ptrgl ttcpip+1020 bl _ptrgl opitsk+1124 bl 01FA4F60 opiino+0990 bl opitsk 0FFFFD410 ? 000000000 ? opiodr+0ae0 bl _ptrgl opidrv+0484 bl 01FA3DB0 sou2o+0090 bl opidrv 3C02DAE25C ? 44065F000 ? FFFFFFFFFFFF310 ? opimai_real+01bc bl 01FA16D4 main+0098 bl opimai_real 000000000 ? 000000000 ? __start+0098 bl main 000000000 ? 000000000 ? |
/soft/oracle/admin/int/udump/int1_ora_3678348.trc Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP, Data Mining and Real Application Testing options ORACLE_HOME = /soft/oracle/product/10.2.0/db_1 System name: AIX Node name: racdb1 Release: 3 Version: 5 Machine: 00C0D0A44C00 Instance name: int1 Redo thread mounted by this instance: 1 Oracle process number: 1098 Unix process pid: 3678348, image: oracle@racdb1 *** ACTION NAME:() 2012-03-23 14:57:00.607 *** MODULE NAME:(java@esb2 (TNS V1-V3)) 2012-03-23 14:57:00.607 *** SERVICE NAME:(int) 2012-03-23 14:57:00.607 *** SESSION ID:(1085.57096) 2012-03-23 14:57:00.607 *** 2012-03-23 14:57:00.607 ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [kope2_readstr232], [1], [], [], [], [], [], [] Current SQL statement for this session: begin dbms_aqin.aq$_enqueue_obj(queue_name => :1, sender_name => :2, sender_addr => :3, sender_protocol => :4, original_msgid => :5, correlation => :6, visibility => :7, priority=> :8, delay => :9, expiration=> :10, relative_msgid => :11, sequence_deviation => :12, exception_queue => :13, payload_type => :14, raw_user_data => null, object_user_data => :15, msgid => :16, recipients => :17, signature => :18, transformation => :19, delivery_mode =>:20); end; ----- PL/SQL Call Stack ----- object line object handle number name 70000080e2621d8 1 anonymous block ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- ksedst+001c bl ksedst1 000000000 ? FFFFFFFFFFEADC4 ? ksedmp+0290 bl ksedst 104A43B68 ? ksfdmp+0018 bl 03F35A00 kgerinv+00dc bl _ptrgl kgesinv+0020 bl kgerinv 10044E8A8 ? 000000002 ? 10123E8E8 ? 0FFFFFFFF ? 000000160 ? kgesin+003c bl kgesinv FFFFFFFFFFEF7B0 ? 11044E964 ? FFFFFFFFFFEB2F8 ? FFFFFFFFFFEB2F0 ? FFFFFFFFFFEBA60 ? kopp2upic+040c bl kgesin 110195408 ? 110430040 ? 104BCDB28 ? 100000001 ? 000000000 ? 000000001 ? B38F0000000001 ? 104A3D358 ? kope2upic2+1738 bl kopp2upic FFFFFFFFFFEB2E0 ? 482C22401040B342 ? 1010E16A8 ? 11040B2D8 ? FFFFFFFFFFEB3E0 ? 11022A3D0 ? FFFFFFFFFFEB2E0 ? 000000001 ? kodpunp+1914 bl kope2upic2 101C2AB68 ? 482C4226E8FF00C0 ? 10365A044 ? 700000081F0BF40 ? FFFFFFFFFFEB970 ? 000000000 ? 0FFFED260 ? 110000388 ? kopu2upkl+06ac bl kodpunp 110195408 ? 000000000 ? 000000000 ? FFFFFFFFFFF1C58 ? FFFFFFFFFFEF878 ? 70000078731F6B6 ? 70000078731F6B6 ? 70000078731F773 ? kopp2uattr+01f4 bl _ptrgl kopp2ucoll+17fc bl kopp2uattr 110195408 ? 000000000 ? 110521228 ? FFFFFFFFFFEFE10 ? FFFFFFFFFFEFE10 ? 44222082FFFEFFD0 ? 10044CBB0 ? 000000000 ? kopp2upic+015c bl kopp2ucoll 1104A95C0 ? 1104A9610 ? 700000010028968 ? 700000010028968 ? 7000007EF84AC18 ? 2C00000000000000 ? 000000000 ? 000000000 ? kope2upic2+1738 bl kopp2upic 000000000 ? 000000000 ? 004C0181C ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? kope2upic+0110 bl kope2upic2 FFFFFFFFFFF0A90 ? 2620404000000000 ? 1002EFD98 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? 000000000 ? pmucp2upkl+04d0 bl kope2upic 000000000 ? 000000000 ? 7000000BEB7A000 ? 000000000 ? 000002000 ? 11044AF0C ? 700000010018078 ? 7000000BEF682A0 ? kopu2upkl+0444 bl pmucp2upkl FFFFFFFFFFF1B38 ? FFFFFFFFFFF2158 ? 8500000000 ? 110522778 ? FB00000000000030 ? 70000078731F68E ? 70000078731F68E ? 000000000 ? kopp2uattr+01f4 bl _ptrgl kopp2upic+08e0 bl kopp2uattr 10030AABC ? FFFFFFFFFFF1990 ? FFFFFFFFFFF1860 ? 11022A3D0 ? 110195408 ? 1104567C0 ? FFFFFFFFFFF6630 ? 110437390 ? kope2upic2+1738 bl kopp2upic 000000001 ? 110443AB0 ? 200000002 ? 110436158 ? 7000007EF84AC18 ? 11043FA38 ? FFFFFFFFFFF17A0 ? FFFFFFFFFFF17A8 ? kodpunp+1914 bl kope2upic2 110195408 ? 000000080 ? 000000001 ? 000000000 ? 10502C888 ? 110000388 ? 700000010027FE8 ? 7000007F296AC38 ? kokoupkl+0fb0 bl kodpunp 110195408 ? 000000000 ? 000000000 ? 70000078731F920 ? FFFFFFFFFFF5E70 ? 70000078731F588 ? 70000078731F588 ? 70000078731F800 ? kpcocaup+0340 bl _ptrgl psdgbaa+0858 bl 03F36944 psdgba+0080 bl psdgbaa 7000000100311C8 ? 1101E7150 ? FFFFFFFFFFF65F8 ? 000005280 ? 000000000 ? 11047F040 ? 000000000 ? 000000019 ? pfrfd2_init_binds+0 bl _ptrgl 5fc pfrfd_init_frame+00 bl pfrfd2_init_binds 77000110195408 ? 010008110 ? ec 104C763CC ? 000000000 ? 000000000 ? 110284190 ? 004220283 ? pfrinstr_INFR+004c bl pfrfd_init_frame 101F83D28 ? 000000001 ? 700000787231A5B ? pfrrun_no_tool+005c bl _ptrgl pfrrun+1014 bl pfrrun_no_tool FFFFFFFFFFF6AA0 ? 110456BA0 ? 110415FE0 ? plsql_run+06b4 bl pfrrun 110473188 ? peicnt+0224 bl plsql_run 110473188 ? 1000000000000 ? 000000000 ? kkxexe+0250 bl peicnt FFFFFFFFFFF7DB8 ? 110473188 ? opiexe+2ef8 bl kkxexe 11047F040 ? kpoal8+0edc bl opiexe FFFFFFFFFFFB3D4 ? FFFFFFFFFFFAFC8 ? FFFFFFFFFFF95A8 ? opiodr+0ae0 bl _ptrgl ttcpip+1020 bl _ptrgl opitsk+1124 bl 01FA4F60 opiino+0990 bl opitsk 0FFFFD410 ? 000000000 ? opiodr+0ae0 bl _ptrgl opidrv+0484 bl 01FA3DB0 sou2o+0090 bl opidrv 3C02DAE25C ? 44065F000 ? FFFFFFFFFFFF310 ? opimai_real+01bc bl 01FA16D4 main+0098 bl opimai_real 000000000 ? 000000000 ? __start+0098 bl main 000000000 ? 000000000 ? |
把节点2的日志同步模式改为async后就不会在频繁切换日志了,也补出现这些错误了:
ALTER SYSTEM SET log_archive_dest_3='service=intdg2 lgwr async affirm valid_for=(online_logfiles,primary_role) db_unique_name=intdg2' SCOPE=BOTH;
同时把两个节点的init文件内容全部修改为SPFILE='/dev/rint_spfile'
总结:这个事件本应该是可以避免的,但是由于工作的疏忽导致的,所以啊很多时候大故障都是由小细节引起的。细节决定成败!