IBM AIX平台的通用保护错误Thread Dump案例分析

问题描述

某客户Oracle Agile PLM的集群服务器中的一个节点突然crash掉,在javacore(Thread Dump)中记录了GPF错误。

问题分析

系统环境如下:

  • OS: AIX 5.3 64bit
  • Middleware: Weblogic 10.3.0.0
  • JDK: IBM JDK pap6460sr5-20090529_04 (SR5)
  • App: Agile PLM 9.3.0, 9.3.0.1

javacore的头部信息显示为 GPF,即General Protection Fault

1TISIGINFO     Dump Event "gpf" (00002000) received 

预览整个javacore文件,注意到有个state:B,即一个被block的线程,ID为0x0000000119092300。

3XMTHREADINFO      "[ACTIVE] ExecuteThread: '4' for queue: 'weblogic.kernel.Default (self-tuning)'" 
TID:0x0000000119092300, j9thread_t:0x0000000118BD6420, 
state:B, prio=5
3XMTHREADINFO1            (native thread ID:0x1A504F, native priority:0x5, native policy:UNKNOWN)
4XESTACKTRACE          at com/agile/pc/cmserver/list/ListCache.getList
(ListCache.java:72(Compiled Code))
4XESTACKTRACE          at com/agile/pc/cmserver/list/ListCache.initMap
(ListCache.java:102(Compiled Code))
4XESTACKTRACE          at com/agile/pc/cmserver/list/ListCache.getMap
(ListCache.java:90(Compiled Code))
4XESTACKTRACE          at com/agile/pc/cmserver/base/BaseUserGroupService.isOrganizationID
(BaseUserGroupService.java:122(Compiled Code))
4XESTACKTRACE          at com/agile/pc/cmserver/base/CellValueBuilder.createUserValue
(CellValueBuilder.java:427(Compiled Code))
4XESTACKTRACE          at com/agile/pc/cmserver/base/SQLHelper.getColumnValue
(SQLHelper.java:625(Compiled Code))
...
...
4XESTACKTRACE          at weblogic/servlet/internal/ServletRequestImpl.run
(Bytecode PC:208(Compiled Code))
4XESTACKTRACE          at weblogic/work/ExecuteThread.execute
(ExecuteThread.java:201(Compiled Code))
4XESTACKTRACE          at weblogic/work/ExecuteThread.run
(ExecuteThread.java:173)

再次分析Monitor Pool Dump数据区,注意到该线程Waiting to enter,而Monitor被另外一个thread 26的线程所持有没有释放,该线程为0x000000011B883800。

3LKMONOBJECT       com/agile/pc/cmserver/list/ListCache
@0x0700000073E01AE0/0x0700000073E01AF8: 
Flat locked by "Thread-26" (0x000000011B883800), entry count 1
3LKWAITERQ            Waiting to enter:
3LKWAITER                "[ACTIVE] ExecuteThread: '4' for queue: 
'weblogic.kernel.Default (self-tuning)'" (0x0000000119092300)

继续查找这个新出现的线程0x000000011B883800,而奇怪的是该线程为CW状态,即主动等待。

3XMTHREADINFO      "Thread-26" TID:0x000000011B883800, 
j9thread_t:0x00000001173F7660, state:CW, prio=1
3XMTHREADINFO1            (native thread ID:0x1AC015, native priority:0x1, native policy:UNKNOWN)
4XESTACKTRACE          at java/net/SocketInputStream.socketRead0(Native Method)
4XESTACKTRACE          at java/net/SocketInputStream.read(SocketInputStream.java:140(Compiled Code))
4XESTACKTRACE          at oracle/net/ns/Packet.receive(Bytecode PC:31(Compiled Code))
4XESTACKTRACE          at oracle/net/ns/DataPacket.receive(Bytecode PC:1(Compiled Code))
4XESTACKTRACE          at oracle/net/ns/NetInputStream.getNextPacket(Bytecode PC:1(Compiled Code))
4XESTACKTRACE          at oracle/net/ns/NetInputStream.read(Bytecode PC:33(Compiled Code))
4XESTACKTRACE          at oracle/net/ns/NetInputStream.read(Bytecode PC:5(Compiled Code))
4XESTACKTRACE          at oracle/net/ns/NetInputStream.read(Bytecode PC:5(Compiled Code))
4XESTACKTRACE          at oracle/jdbc/driver/T4CMAREngine.unmarshalUB1(T4CMAREngine.java:979(Compiled Code))
4XESTACKTRACE          at oracle/jdbc/driver/T4CMAREngine.unmarshalSB1(T4CMAREngine.java:951(Compiled Code))
4XESTACKTRACE          at oracle/jdbc/driver/T4C8Oall.receive(T4C8Oall.java:419(Compiled Code))
4XESTACKTRACE          at oracle/jdbc/driver/T4CPreparedStatement.doOall8(T4CPreparedStatement.java:185(Compiled Code))
4XESTACKTRACE          at oracle/jdbc/driver/T4CPreparedStatement.fetch(T4CPreparedStatement.java:692(Compiled Code))
4XESTACKTRACE          at oracle/jdbc/driver/OracleResultSetImpl.close_or_fetch_from_next(OracleResultSetImpl.java:258(Compiled Code))
4XESTACKTRACE          at oracle/jdbc/driver/OracleResultSetImpl.next(OracleResultSetImpl.java:193(Compiled Code))
4XESTACKTRACE          at weblogic/jdbc/wrapper/ResultSet_oracle_jdbc_driver_OracleResultSetImpl.next(Bytecode PC:20(Compiled Code))
4XESTACKTRACE          at com/agile/pc/cmserver/usergroup/UserGroupDAO.retrieveAllGroups(UserGroupDAO.java:779(Compiled Code))
4XESTACKTRACE          at com/agile/pc/cmserver/list/ListCache.loadUserGroupList(ListCache.java:316)
4XESTACKTRACE          at com/agile/pc/cmserver/list/ListCache.loadList(ListCache.java:201)

深入细节

Oracle JDBC thin driver层出现socketRead0等待情况而非超时,socket没有超时,weblogic的事务也没有超时,大致有两个可能的原因,jdbc driver的bug,或者查询繁忙(或查询正常但处理查询的结果太忙,此时jdbc的事务尚没有关闭)。

继续读该线程,注意到有此方法可能相对很耗时,客户有太多的数据(UserGroup):

com/agile/pc/cmserver/usergroup/UserGroupDAO.retrieveAllGroups

问题找到。

附件

Thread Dump文件: javacore.20100606.115449.544804.0006.zip

你可能感兴趣的:(Agile,PLM,Oracle,Agile,PLM,thread,aix,ibm,平台,weblogic,jdbc)