早晨到办公室听同事说表被锁了,一试,发现表中某字段为1111111的行都被锁了,SELECT都不行。报错误ORA-01591,打开TOAD的KnowledgeeXpert,描述很少,只是说由于分布式事务错误而造成锁定。询问同事,昨天通过一个调用另一个存储过程出了错误,而后者通过透明网关insert一些数据到数据库。
立即想到打开OEM,谁知道大失所望,进入锁,根本没发现相关的对象被锁定,开始有点郁闷。转而检查会话,该用户有5个会话,都是INACTIVE,不管三七二十一,全部杀掉。结果依旧,并且锁也没有出现。远程登陆上主机,发现和都正常,也没有发现透明网关进程挂死(之前曾发现TG4SQL在无业务量时也会出现25%左右的CPU,挂死)。
突然想到看看alert.log,经过仔细搜索,终于发现:
WedNov1700:00:042004
Errorsinfiled:“oracle“admin“xdcj“udump“xdcj_j006_3020.trc:
ORA-12012:自动执行作业82出错
ORA-01591:锁定已被有问题的分配事务处理6.5.887985挂起
ORA-06512:在line6
这正是出错的地方,往前追溯:
TueNov1617:35:042004
Error28500trappedin2PContransaction6.5.887985.Cleaningup.
Errorstackreturnedtouser:
ORA-02054:事务处理6.5.887985有问题
ORA-28500:连接ORACLE到非Oracle系统时返回此:
[TransparentgatewayforMSSQL]
ORA-02063:紧接着2lines(源于ZSMOS_CRM)
TueNov1617:35:042004
DISTRIBTRANQDCJ.US.ORACLE.COM.5ae32328.6.5.887985
islocaltran6.5.887985
insertpendingpreparedtran,scn=6606197672830
TueNov1617:35:072004
Errorsinfiled:“oracle“admin“xdcj“bdump“xdcj_reco_3024.trc:
ORA-28500:connectionfromORACLEtoanon-Oraclesystemreturnedthismessage:
[TransparentgatewayforMSSQL][Microsoft][ODBCSQLServerDriver][SQLServer]用户'RECOVER'登录失败。
ORA-02063:preceding2linesfromZSMOS_CRM
TueNov1617:35:122004
Errorsinfiled:“oracle“admin“xdcj“bdump“xdcj_reco_3024.trc:
ORA-28500:connectionfromORACLEtoanon-Oraclesystemreturnedthismessage:
[TransparentgatewayforMSSQL][Microsoft][ODBCSQLServerDriver][SQLServer]用户'RECOVER'登录失败。
ORA-02063:preceding2linesfromZSMOS_CRM
这就是事发地点了。看来是昨天下午远程事务失败,但是又没有返回造成分布式事务挂死,从而锁定了行。终于找到了详细的错误ORA-02054,进入TOAD一查,说是要等待或者提交该事务,可是怎么操作呢。还是打开官方文档搜索相关内容,在AdminstratorGuide中发现如下内容:
DiscoveringProblemswithaTwo-PhaseCommit
Theuserapplicationthatcommitsadistributedtransactionisinformedofaproblembyoneofthefollowingerrormessages:
ORA-02050:transactionIDrolledback,
someremotedbsmaybein-doubt
ORA-02051:transactionIDcommitted,
someremotedbsmaybein-doubt
ORA-02054:transactionIDin-doubt
Arobustapplicationshouldsaveinformationaboutatransactionifitreceivesanyoftheaboveerrors.Thisinformationcanbeusedlaterifmanualdistributedtransactionrecoveryisdesired.
Noactionisrequiredbytheadministratorofanynodethathasoneormorein-doubtdistributedtransactionsduetoanetworkorsystemfailure.TheautomaticrecoveryfeaturesofOracletransparentlycompleteanyin-doubttransactionsothatthesameoutcomeoccursonallnodesofasessiontreeafterthenetworkorsystemfailureisresolved.
Inextendedoutages,however,youcanforcethecommitorrollbackofatransactiontoreleaseanylockeddata.Applicationsmustaccountforsuchpossibilities.
DeterminingWhethertoPerformaManualOverride
Overrideaspecificin-doubttransactionmanuallyonlywhenoneofthefollowingsituationsexists:
Thein-doubttransactionlocksdatathatisrequiredbyothertransactions.ThissituationoccurswhentheORA-01591errormessageinterfereswithusertransactions.
Anin-doubttransactionpreventstheextentsofarollbacksegmentfrombeingusedbyothertransactions.Thefirstportionofanin-doubtdistributedtransaction'slocaltransactionIDcorrespondstotheIDoftherollbacksegment,aslistedbythedatadictionaryviewsDBA_2PC_PENDINGandDBA_ROLLBACK_SEGS.
Thefailurepreventingthetwo-phasecommitphasestocompletecannotbecorrectedinanacceptabletimeperiod.Examplesofsuchcasesincludeatelecommunicationnetworkthathasbeendamagedoradamageddatabasethatrequiresalongrecoverytime.
Normally,youshouldmakeadecisiontolocallyforceanin-doubtdistributedtransactioninconsultationwithadministratorsatotherlocations.Awrongdecisioncanleadtodatabaseinconsistenciesthatcanbedifficulttotraceandthatyoumustmanuallycorrect.
Iftheconditionsabovedonotapply,alwaysallowtheautomaticrecoveryfeaturesofOracletocompletethetransaction.Ifanyoftheabovecriteriaaremet,however,consideralocaloverrideofthein-doubttransaction.
看来是建议差不多,后面Oracle总是试图登录SQlServer就是要自动恢复,可是总不成功。察看视图DBA_2PC_PENDING确实发现了该事务的痕迹。要怎样操作呢?
ManuallyCommittinganIn-DoubtTransaction
Beforeattemptingtocommitthetransaction,ensurethatyouhavetheproperprivileges.Notethefollowingrequirements:
Ifthetransactionwascommittedby...Thenyoumusthavethisprivilege...
You
FORCETRANSACTION
Anotheruser
FORCEANYTRANSACTION
CommittingUsingOnlytheTransactionID
ThefollowingSQLstatementcommitsanin-doubttransaction:
COMMITFORCE'transaction_id';
Thevariabletransaction_idistheidentifierofthetransactionasspecifiedineithertheLOCAL_TRAN_IDorGLOBAL_TRAN_IDcolumnsoftheDBA_2PC_PENDINGdatadictionaryview.
Forexample,assumethatyouqueryDBA_2PC_PENDINGanddeterminethatLOCAL_TRAN_IDforadistributedtransactionis1:45.13.
YouthenissuethefollowingSQLstatementtoforcethecommitofthisin-doubttransaction:
COMMITFORCE'1.45.13';
CommittingUsinganSCN
Optionally,youcanspecifytheSCNforthetransactionwhenforcingatransactiontocommit.Thisfeatureallowsyoutocommitanin-doubttransactionwiththeSCNassignedwhenitwascommittedatothernodes.
Consequently,youmaintainthesynchronizedcommittimeofthedistributedtransactionevenifthereisafailure.SpecifyanSCNonlywhenyoucandeterminetheSCNofthesametransactionalreadycommittedatanothernode.
Forexample,assumeyouwanttomanuallycommitatransactionwiththefollowingglobaltransactionID:
SALES.ACME.COM.55d1c563.1.93.29
First,querytheDBA_2PC_PENDINGviewofaremotedatabasealsoinvolvedwiththetransactioninquestion.NotetheSCNusedforthecommitofthetransactionatthatnode.SpecifytheSCNwhencommittingthetransactionatthelocalnode.Forexample,iftheSCNis829381993,issue:
COMMITFORCE'SALES.ACME.COM.55d1c563.1.93.29',829381993;
SeeAlso:
Oracle9iSQLReferenceformoreinformationaboutusingtheCOMMITstatement
ManuallyRollingBackanIn-DoubtTransaction
Beforeattemptingtorollbackthein-doubtdistributedtransaction,ensurethatyouhavetheproperprivileges.Notethefollowingrequirements:
Ifthetransactionwascommittedby...Thenyoumusthavethisprivilege...
You
FORCETRANSACTION
Anotheruser
FORCEANYTRANSACTION
ThefollowingSQLstatementrollsbackanin-doubttransaction:
ROLLBACKFORCE'transaction_id';
Thevariabletransaction_idistheidentifierofthetransactionasspecifiedineithertheLOCAL_TRAN_IDorGLOBAL_TRAN_IDcolumnsoftheDBA_2PC_PENDINGdatadictionaryview.
Forexample,torollbackthein-doubttransactionwiththelocaltransactionIDof2.9.4,usethefollowingstatement:
ROLLBACKFORCE'2.9.4';
于是登陆数据库
COMMITFORCE'6.5.887985';
然后查看DBA_2PC_PENDING发现状态已经改为'COMMITFORCE',SELECT该表相关行,一切正常。至此,故障解决。
总体来看,直接INSERT...TABLENAME@SQLDBLK还是很危险的,遇上不能正常返回就出问题了。Oracle的文档是推荐使用包或者存储过程来解决,此后建议同事改用此方法,目前已经测试通过。