某客户Oracle Agile PLM的集群服务器中的一个节点突然crash掉,在javacore(Thread Dump)中记录了GPF错误。
系统环境如下:
javacore的头部信息显示为 GPF,即General Protection Fault
1TISIGINFO Dump Event "gpf" (00002000) received
预览整个javacore文件,注意到有个state:B,即一个被block的线程,ID为0x0000000119092300。
3XMTHREADINFO "[ACTIVE] ExecuteThread: '4' for queue: 'weblogic.kernel.Default (self-tuning)'" TID:0x0000000119092300, j9thread_t:0x0000000118BD6420, state:B, prio=5 3XMTHREADINFO1 (native thread ID:0x1A504F, native priority:0x5, native policy:UNKNOWN) 4XESTACKTRACE at com/agile/pc/cmserver/list/ListCache.getList (ListCache.java:72(Compiled Code)) 4XESTACKTRACE at com/agile/pc/cmserver/list/ListCache.initMap (ListCache.java:102(Compiled Code)) 4XESTACKTRACE at com/agile/pc/cmserver/list/ListCache.getMap (ListCache.java:90(Compiled Code)) 4XESTACKTRACE at com/agile/pc/cmserver/base/BaseUserGroupService.isOrganizationID (BaseUserGroupService.java:122(Compiled Code)) 4XESTACKTRACE at com/agile/pc/cmserver/base/CellValueBuilder.createUserValue (CellValueBuilder.java:427(Compiled Code)) 4XESTACKTRACE at com/agile/pc/cmserver/base/SQLHelper.getColumnValue (SQLHelper.java:625(Compiled Code)) ... ... 4XESTACKTRACE at weblogic/servlet/internal/ServletRequestImpl.run (Bytecode PC:208(Compiled Code)) 4XESTACKTRACE at weblogic/work/ExecuteThread.execute (ExecuteThread.java:201(Compiled Code)) 4XESTACKTRACE at weblogic/work/ExecuteThread.run (ExecuteThread.java:173)
再次分析Monitor Pool Dump数据区,注意到该线程Waiting to enter,而Monitor被另外一个thread 26的线程所持有没有释放,该线程为0x000000011B883800。
3LKMONOBJECT com/agile/pc/cmserver/list/ListCache
@0x0700000073E01AE0/0x0700000073E01AF8:
Flat locked by "Thread-26" (0x000000011B883800), entry count 1
3LKWAITERQ Waiting to enter:
3LKWAITER "[ACTIVE] ExecuteThread: '4' for queue:
'weblogic.kernel.Default (self-tuning)'" (0x0000000119092300)
继续查找这个新出现的线程0x000000011B883800,而奇怪的是该线程为CW状态,即主动等待。
3XMTHREADINFO "Thread-26" TID:0x000000011B883800, j9thread_t:0x00000001173F7660, state:CW, prio=1 3XMTHREADINFO1 (native thread ID:0x1AC015, native priority:0x1, native policy:UNKNOWN) 4XESTACKTRACE at java/net/SocketInputStream.socketRead0(Native Method) 4XESTACKTRACE at java/net/SocketInputStream.read(SocketInputStream.java:140(Compiled Code)) 4XESTACKTRACE at oracle/net/ns/Packet.receive(Bytecode PC:31(Compiled Code)) 4XESTACKTRACE at oracle/net/ns/DataPacket.receive(Bytecode PC:1(Compiled Code)) 4XESTACKTRACE at oracle/net/ns/NetInputStream.getNextPacket(Bytecode PC:1(Compiled Code)) 4XESTACKTRACE at oracle/net/ns/NetInputStream.read(Bytecode PC:33(Compiled Code)) 4XESTACKTRACE at oracle/net/ns/NetInputStream.read(Bytecode PC:5(Compiled Code)) 4XESTACKTRACE at oracle/net/ns/NetInputStream.read(Bytecode PC:5(Compiled Code)) 4XESTACKTRACE at oracle/jdbc/driver/T4CMAREngine.unmarshalUB1(T4CMAREngine.java:979(Compiled Code)) 4XESTACKTRACE at oracle/jdbc/driver/T4CMAREngine.unmarshalSB1(T4CMAREngine.java:951(Compiled Code)) 4XESTACKTRACE at oracle/jdbc/driver/T4C8Oall.receive(T4C8Oall.java:419(Compiled Code)) 4XESTACKTRACE at oracle/jdbc/driver/T4CPreparedStatement.doOall8(T4CPreparedStatement.java:185(Compiled Code)) 4XESTACKTRACE at oracle/jdbc/driver/T4CPreparedStatement.fetch(T4CPreparedStatement.java:692(Compiled Code)) 4XESTACKTRACE at oracle/jdbc/driver/OracleResultSetImpl.close_or_fetch_from_next(OracleResultSetImpl.java:258(Compiled Code)) 4XESTACKTRACE at oracle/jdbc/driver/OracleResultSetImpl.next(OracleResultSetImpl.java:193(Compiled Code)) 4XESTACKTRACE at weblogic/jdbc/wrapper/ResultSet_oracle_jdbc_driver_OracleResultSetImpl.next(Bytecode PC:20(Compiled Code)) 4XESTACKTRACE at com/agile/pc/cmserver/usergroup/UserGroupDAO.retrieveAllGroups(UserGroupDAO.java:779(Compiled Code)) 4XESTACKTRACE at com/agile/pc/cmserver/list/ListCache.loadUserGroupList(ListCache.java:316) 4XESTACKTRACE at com/agile/pc/cmserver/list/ListCache.loadList(ListCache.java:201)
Oracle JDBC thin driver层出现socketRead0等待情况而非超时,socket没有超时,weblogic的事务也没有超时,大致有两个可能的原因,jdbc driver的bug,或者查询繁忙(或查询正常但处理查询的结果太忙,此时jdbc的事务尚没有关闭)。
继续读该线程,注意到有此方法可能相对很耗时,客户有太多的数据(UserGroup):
com/agile/pc/cmserver/usergroup/UserGroupDAO.retrieveAllGroups
问题找到。
Thread Dump文件: javacore.20100606.115449.544804.0006.zip