一次极其痛苦troubleshooting:ORA-04061: existing state of has been invalidated

报错信息(包体信息已经换成xxxx):

ORA-04061: existing state of has been invalidated 
ORA-04061: existing state of package "xxxx" has been invalidated 
ORA-04065: not executed, altered or dropped package "xxxx" 
ORA-06508: PL/SQL: could not find prog 

情况描述 一:开发人员之前找我分析过一个存储过程问题,给我看到的报错信息是ORA-01002: fetch out of sequence

看到这样的报错,立马判断应当是存储过程中的游标写的有问题,或者是相关数据有问题,导致fetch游标的时候出错。这几天事情太多,没有精力也不愿意在这种问题上深究,便让开发人员自己排查程序,是否存在异常。

情况描述二:过了大概两天,领导问我关于这个问题,意思开发那边还未解决,希望我处理下,而且系统将在三四天后上线,问题比较紧急。无法,着手处理。

                        和开发人员沟通,获取到详细的报错日志:


Errstack is: ORA-04061: existing state of  has been invalidated
ORA-04061: existing state of package "CBMAIN.PKG_JN_INAC" has been invalidated
ORA-04065: not executed, altered or dropped package "CBMAIN.PKG_JN_INAC"
ORA-06508: PL/SQL: could not find program unit being called: "CBMAIN.PKG_JN_INAC"
Callstack is: ----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
70000006e55e6b0       805  package body CBMAIN.PKG_ERR
70000006e670858       650  package body CBMAIN.PKG_INAC
70000006e6c7520      1478  package body CBMAIN.PKG_DEBT_PROD_CMD
70000006e68b530       468  package body CBMAIN.PKG_LNAP_CMD
70000006e6ddd40      1371  package body CBMAIN.PKG_JN_LNAP
7000000614cb0a0         1  anonymous block
70000006e76c400       380  procedure CBMAIN.INTERFACE_MAIN
700000061f6b9f0         2  anonymous block

Errcode is: -44 Errormsg is: 执行业务处理过程[pkg_jn_lnap.lnap_comm_tran]出错,错误描述[ORA-01002: fetch out of sequence
]
Errstack is: ORA-01002: fetch out of sequence

Callstack is: ----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
70000006e55e6b0       805  package body CBMAIN.PKG_ERR
70000006e76c400       421  procedure CBMAIN.INTERFACE_MAIN
700000061f6b9f0         2  anonymous block

情况描述三:票据信息向核心系统发送报文,核心系统接口接收报文后,通过package处理报文中的相关信息,若是其中涉及到总账交易的信息,便会交易失败。导致票据系统那边报错。通过和开发人员的不断沟通,又获取如下信息:这类交易并不是每次都会报错,并且每次报错都是在核心系统跑批过后才会出现,另外每次出现这种错误,开发人员重新编译package /package body等相关对象依然过不去,重启数据库后该类型交易可成功。

                       由于该问题只能发生在核心跑批后才能重现,导致我第一天只能疯狂的看那些密密麻麻的存储过程,游标。。。并且,主要的切入点是ora-01002:fetch out of sequence    然后在查看了各种关于ora-01002的错误文档信息,都不得其解。让开发人员在存储过程中添加了大量的日志输出信息,以期在第二天跑批后可以看到报错具体出现在哪里。

情况描述四:第二天满心欢喜的等着看日志输出在哪出错了,结果日志没有打出来。what xxxx!!此时心里开始有点焦虑和茫然。只好硬着头皮坐下来拉着开发人员又是一点一点分析。并且不断在metalink上查找,百度。毫无建树。

                       又从开发人员哪里得到信息:核心跑批后,票据系统做这种交易时便会报错,报错后若是用plsqldeveloper 连接数据库,手工执行相关存储过程,交易成功。。。

再加上之前的重启数据库后也能成功,渐渐地,怀疑是不是存在某些信息或者变量或者内存不一致的情况,便把从分析ORA-01002: fetch out of sequence的思绪中解脱出来,怀疑并不是由于ora-01002导致了交易失败,而是由于其他问题导致了出现ora-01002。

 情况描述五:重新分析这几天的报错信息,发现每次都有如下信息出现:

Errcode is: -114 Errormsg is: GL:执行Accounting出错,错误代号[-6508],错误信息\:[ORA-06508\: PL/SQL\: could not find program unit being called]
Errstack is: ORA-20001: 
ORA-06512: at "GL.PKG_ERR", line 272
ORA-06512: at "GL.PKG_GL_GLVC", line 521
ORA-04061: existing state of package body "GL.PKG_GL_EDCT" has been invalidated
ORA-04065: not executed, altered or dropped package body "GL.PKG_GL_EDCT"
ORA-06508: PL/SQL: could not find program unit being called: "GL.PKG_GL_EDCT"
Callstack is: ----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
70000006e55e6b0       805  package body CBMAIN.PKG_ERR
70000006e112b28       158  package body GL.PKG_GL_EXTN
70000006e63a538      1225  package body CBMAIN.PKG_JN_INAC
70000006e343768         1  anonymous block
70000006e5b13a8      1381  package body CBMAIN.PKG_INAC_CMD
70000006e6c7520      1545  package body CBMAIN.PKG_DEBT_PROD_CMD
70000006e68b530       468  package body CBMAIN.PKG_LNAP_CMD
70000006e6ddd40      1371  package body CBMAIN.PKG_JN_LNAP
7000000614cb0a0         1  anonymous block
70000006e76c400       380  procedure CBMAIN.INTERFACE_MAIN
70000006e75dde0         2  anonymous block

于是怀疑是不是由于PKG_GL_EDCT这个包失效了,导致了后面的一系列问题。思路渐进,但依然感觉困惑,并且由于不可验证性,导致根本无法随时debug。而此时已经接近第二天下班时间,问题依然没有定位出来,没有得到解决。

而在临近下班时,突然灵光一闪,每次都是重启数据库,今晚让开发重启下应用看看,也就是tuxedo。

和开发人员提出重启应用后,内心便充满期待,而且是如此的迫切。并在心里不断的推演,越发觉得,重启应用应该也可以解决该问题。当然,最终解决该问题肯定不能靠每次重启应用的,但若是重启应用后正常的话,便验证了我的推断:该问题是由于应用上一直挂着长连接,而这些长连接已经持有了PKG_GL_EDCT这个包的相关信息,但在跑批后,该package body做过ddl操作,重新编译过(查看生产环境后发现每晚跑批后都有被做过ddl操作),导致这些持有PKG_GL_EDCT的连接关于该包体的信息都是过期的,继而调用的时候报错,导致了交易失败等一系列问题。

---------------------------------------------------------------------------------------------------------------------------------------------------------------------

情况描述六:柳暗花明又一村。第二天晚上一直念叨着这个问题,由于已经过去两天,生怕万一重启应用后问题没有解决,所有的推论将不成立。甚是惶恐。

                        第三天一大早上班便迫不及待的寻问开发人员。这个憨厚的小伙子回答我说:哥,重启tuxedo,好使。

                        瞬间觉得前途一片光明,感觉问题大概八九不离十了。但百度和metalink已经提供不了更多信息,这时想起好久没有的bing,发现有个国际版,于是把错误信息ORA-04061: existing state of has been invalidated 输入,发现了几篇很有意义的文章:

第一个:(这个老外遇到了和我一样的问题,报出04061错误,并且很迷茫,该问题并不是每次都出现)

Can somebody tell why these errors are coming sometimes, though this package (and its dependent packages) are valid.
Its not appearing each time, it just pops up some time.
----------------------------------------------------------------------
ORA-04061: existing state of has been invalidated 
ORA-04061: existing state of package "CHECK_DATA" has been invalidated 
ORA-04065: not executed, altered or dropped package "CHECK_DATA" 
ORA-06508: PL/SQL: could not find prog 

-------------------------------------------------------------------------

第二个:(有点长,大意就是weblogic长连接挂着,相关对象被其他用户做了ddl,导致已有老连接中持有了老的信息,解决办法便是重置连接)

ORA-04061: existing state of package "HR.EMPLOYEES_API" has been invalidated
ORA-04065: not executed, altered or dropped package "HR.EMPLOYEES_API"
ORA-06508: PL/SQL: could not find program unit being called: "HR.EMPLOYEES_API"
But when I check the state of package in database, its in valid state and it worked without any issues, when I try to execute the same program unit from database.
Considering above analysis, I concluded that there is no issue with database program unit. Then I started looking into SOA Web service and Weblogic server data source configuration.
Further investigation on SOA web service testing reveals that, the database program unit is getting invoked sometimes without any issues.
During my R&D, I did resetting data source connection pool. From then, I did not faced this issue.
Final Report:
This issue raises, when the database program being altered/created/compiled by another database user and SOA web service tries to invoke it using old data source connection pool i.e. created before altering database program.
Here comes the Weblogic data source connection pooling mechanism, which holds few logical connections created before I alter the database program with different user.
When I reset the data source, it will destroy the existing connections and will re-create them with current state of the database programs and resolves the issue.
Steps to Reset Data Source in WLS:
Go to the Weblogic Console > data sources and select your data source. For Ex: LocalXEDS
Go to the tab ‘Control’ and select the servers to which the data source is targeted and click ‘Reset’.

------------------------------------------------------------------------------------------------------------------------------------------------------

至此,问题基本确认,并和oracle远程中间件工程师咨询了一下,他表示确实存在这样的情况,但中间件层面没有好的办法解决该问题,还是需要从数据库层面处理。

确定问题在哪后,解决问题就相对简单点了,数据库层面的话,可以考虑在程序里面写个dbms_session初始化包状态的脚本,但最好从应用层面解决。和开发那边商量,希望通过修改调用的程序来解决该这个问题。

心里十分舒坦。

你可能感兴趣的:(ORACLE,troubleshooting)