Oracle Goldengate实际使用过程中经常会出现abend的现象,有时ggserr.log提示的信息又让我们摸不着头脑。
2011-11-01 09:14:28 WARNING OGG-01431 Oracle GoldenGate Delivery for Oracle, rep1.prm: Aborted grouped transaction on 'XXX.XXX_BONUS_LOG', Mapping error.
2011-11-01 09:14:28 WARNING OGG-01003 Oracle GoldenGate Delivery for Oracle, rep1.prm: Repositioning to rba 4662627 in seqno 2114.
2011-11-01 09:14:28 WARNING OGG-01151 Oracle GoldenGate Delivery for Oracle, rep1.prm: Error mapping from XXX.XXX_BONUS_LOG to XXX.XXX_BONUS_LOG.
2011-11-01 09:14:28 WARNING OGG-01003 Oracle GoldenGate Delivery for Oracle, rep1.prm: Repositioning to rba 4662627 in seqno 2114.
2011-11-01 09:14:28 ERROR OGG-01296 Oracle GoldenGate Delivery for Oracle, rep1.prm: Error mapping from XXX.XXX_BONUS_LOG to XXX.XXX_BONUS_LOG.
2011-11-01 09:14:28 ERROR OGG-01668 Oracle GoldenGate Delivery for Oracle, rep1.prm: PROCESS ABENDING.
上面的这个错误,如果用logdump去查看,你会发现日志seqno 2114 rba 4662627位置根本就不是XXX.XXX_BONUS_LOG。原因很简单,Goldengate加载的时候默认遵循源端的事务一致性,在这个例子中,seqno 2114 rba 4662627只是事务的起点,而出错的位置在ggserr.log中没有办法定位。这时候就需要用到一些特殊的参数来帮助我们来定位具体的问题原因。
- SHOWSYNTAX
Use the SHOWSYNTAX parameter to start an interactive session where you can view each Replicat SQL statement before it is applied. By viewing the syntax of SQL statements that failed, you might be able to diagnose the cause of the problem.
- NODYNSQL
With DYNSQL, the default, Replicat uses dynamic SQL to compile a statement once, and then execute it many times with different bind variables.
- NOBINARYCHARS
NOBINARYCHARS is an undocumented parameter that causes Oracle GoldenGate to treat binary data as a null-terminated string.
通过这三个参数的结合,在report文件中记录详细的SQL语句,和具体的出错位置,结合logdump和具体的SQL语句,相信很快能够定位出问题的原因。
2011-11-01 09:15:56 WARNING OGG-01431 Aborted grouped transaction on 'XXX.XXX_BONUS_LOG', Mapping error.
2011-11-01 09:15:56 WARNING OGG-01003 Repositioning to rba 4662954 in seqno 2114.
2011-11-01 09:15:56 WARNING OGG-01151 Error mapping from XXX.XXX_BONUS_LOG to XXX.XXX_BONUS_LOG.
2011-11-01 09:15:56 WARNING OGG-01003 Repositioning to rba 4662954 in seqno 2114.Source Context :
SourceModule : [er.main]
SourceID : [/scratch/angorant/view_storage/angorant_ogg_12978807_x64/oggcore/OpenSys/src/app/er/rep.c]
SourceFunction : [take_rep_err_action]
SourceLine : [16134]
ThreadBacktrace : [8] elements
: [/u01/app/oracle/ggs/replicat(CMessageContext::AddThreadContext()+0x26) [0x5ef8b6]]
: [/u01/app/oracle/ggs/replicat(CMessageFactory::CreateMessage(CSourceContext*, unsigned int, ...)+0x7b2) [0x5e6382]]
: [/u01/app/oracle/ggs/replicat(_MSG_ERR_MAP_TO_TANDEM_FAILED(CSourceContext*, DBString<777> const&, DBString<777> const&, CMessageFactory::MessageDisposition)+0x9b) [0x5c4bcb]]
: [/u01/app/oracle/ggs/replicat [0x81ac2f]]
: [/u01/app/oracle/ggs/replicat [0x8f73e2]]
: [/u01/app/oracle/ggs/replicat(main+0x84b) [0x50764b]]
: [/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x30f3c1c40b]]
: [/u01/app/oracle/ggs/replicat(__gxx_personality_v0+0x1da) [0x4e3c2a]]2011-11-01 09:15:56 ERROR OGG-01296 Error mapping from XXX.XXX_BONUS_LOG to XXX.XXX_BONUS_LOG.
可以看到,出错的位置是rba 4662954,具体的trail文件信息如下:
Logdump 195 >n
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x04)
UndoFlag : . (x00) BeforeAfter: B (x42)
RecLength : 227 (x00e3) IO Time : 2011/10/31 11:18:19.230.994
IOType : 3 (x03) OrigNode : 255 (xff)
TransInd : . (x01) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 4602 AuditPos : 200518188
Continued : N (x00) RecCount : 1 (x01)2011/10/31 11:18:19.230.994 Delete Len 227 RBA 4662954
Name: XXX.XXX_BONUS_LOG
Before Image: Partition 4 G m
0000 0015 0000 3230 3131 2d31 302d 3331 3a31 303a | ……2011-10-31:10:
3537 3a34 3500 0100 0a00 0000 0000 0000 1ca2 1100 | 57:45……………
0200 0a00 0000 0000 0000 0000 1b00 0300 0a00 00ff | ………………..
ffff ffff fffc e000 0400 0a00 0000 0000 0000 0000 | ………………..
0000 0500 0700 0000 034e 4554 0006 000a 0000 0000 | ………NET……..
0102 aea6 3a3b 0007 0004 ffff 0000 0008 0014 0000 | ….:;…………..
0010 3331 3131 3938 302d bbfd b7d6 bbbb b9ba 0009 | ..3111980-……….
Column 0 (x0000), Len 21 (x0015)
0000 3230 3131 2d31 302d 3331 3a31 303a 3537 3a34 | ..2011-10-31:10:57:45
通过logdump发现这是处于事务中间的一个删除语句出错了,检查发现这张表的该记录确实不存在,因此导致Error mapping错误的发生。但由于这是事务中间的一条记录,我们不能直接跳到故障语句之后,这里还需要借助另外两个参数的帮助。
- GROUPTRANSOPS
Controls the number of records that are sent to the trail in one batch.
- MAXTRANSOPS
Divides large source transactions into smaller ones on the target system.
通过这两个参数,可以把源端大的事务拆分成小的事务。为了方便起见,我们设置这两个参数为1。
edit params rep1
grouptransops 1
maxtransops 1
再重启rep1进程,rep1进程在出错位置停下来后,手工跳过有问题的语句。
alter rep1, extseqno 2114, extrba 4663281
start rep1
至此,这个问题得到了解决。当然根治这个问题最好的办法还是全同步数据不一致的表,但在一个比较大的生产环境中重新全同步表还是比较麻烦的,在出错语句不是太多的情况下,这也不失为一种解决办法。而我们这个案例刚好是delete操作,因此可以简单的跳过,如果是update或insert则还需要进一步分析。