一次RPC服务调用异常分析

现象:RPC请求一直超时异常如下
java.lang.RuntimeException: xxxRpcTimeOutException-null

分析:查看代码,可能是某个sql或者第三方服务调用超时造成的,查找该服务业务日志,没有发现任何异常日志数据!!这下子尴尬了,

1.机器load指标正常,为了恢复线上问题,重启集群,保留一台机器分析,为了防止该机器被线上流量调用,线上环境先屏蔽该机器。

2.因为没有任何异常日志打印,看下jstack有啥有用的信息没

jps -v 
jstack 1893 |grep '${类名.方法名}'

没有找到相关信息,有可能线程可能已经结束了,尝试手动调用一次Rpc服务,重复步骤2,有调用栈信息输出

RpcBizProcessor-DEFAULT-8-thread-1" daemon prio=10 tid=0x000000000187b800 nid=0x1d86 runnable [0x000000005fccd000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:152)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at oracle.net.ns.Packet.receive(Unknown Source)
        at oracle.net.ns.DataPacket.receive(Unknown Source)
        at oracle.net.ns.NetInputStream.getNextPacket(Unknown Source)
        at oracle.net.ns.NetInputStream.read(Unknown Source)
        at oracle.net.ns.NetInputStream.read(Unknown Source)
        at oracle.net.ns.NetInputStream.read(Unknown Source)
        at oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1104)
        at oracle.jdbc.driver.T4CMAREngine.unmarshalSB1(T4CMAREngine.java:1075)
        at oracle.jdbc.driver.T4C8Oall.receive(T4C8Oall.java:480)
        at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:216)
        at oracle.jdbc.driver.T4CPreparedStatement.executeForRows(T4CPreparedStatement.java:966)
        at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1170)
        at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3339)
        at oracle.jdbc.driver.OraclePreparedStatement.executeUpdate(OraclePreparedStatement.java:3423)
        - locked <0x00000007a2a87d58> (a oracle.jdbc.driver.T4CPreparedStatement)
        - locked <0x00000007a0a9bd90> (a oracle.jdbc.driver.T4CConnection)
        at com.alibaba.druid.filter.FilterChainImpl.preparedStatement_executeUpdate(FilterChainImpl.java:2721)
        at com.alibaba.druid.filter.FilterAdapter.preparedStatement_executeUpdate(FilterAdapter.java:1069)
        at com.alibaba.druid.filter.FilterEventAdapter.preparedStatement_executeUpdate(FilterEventAdapter.java:491)
        at com.alibaba.druid.filter.FilterChainImpl.preparedStatement_executeUpdate(FilterChainImpl.java:2719)
        at com.alibaba.druid.proxy.jdbc.PreparedStatementProxyImpl.executeUpdate(PreparedStatementProxyImpl.java:145)
        at com.alibaba.druid.pool.DruidPooledPreparedStatement.executeUpdate(DruidPooledPreparedStatement.java:253)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:307)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:182)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
        at com.alibaba.alimonitor.jmonitor.plugin.spring.JMonitorMethodInterceptor.invoke(JMonitorMethodInterceptor.java:56)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
        at com.sun.proxy.$Proxy567.unbindingManyObject(Unknown Source)
        at com.ali.caesar.platform.common.dubbo.service.TagBindingDubboService.unbindingManyObject(TagBindingDubboService.java:102)

可以看到线程在等待数据库read执行结果,而且发现这个等待时间还比较长,现在可以断定肯定是sql造成的问题,有可能是sql性能之类的原因

3.下一步就是找到超时的sqlId(sql_id字段),看下是否有sql阻塞,根据调用jstack栈找到出问题的sql

SELECT * FROM  v$sqlarea  where SQL_FULLTEXT like '%sql table name%'
--历史阻塞sql查询,
根据sql_id查找session记录,BLOCKING_SESSION有很多非空的,那么可以肯定发生过阻塞。
SELECT * FROM v$active_session_history hisjoin v$sqlarea saon sa.sql_id = his.sql_idand his.sample_time > TO_DATE('2018-05-24 09:49:00','yyyy-MM-dd HH24:mi:ss')and his.sample_time < TO_DATE('2018-05-24 09:50:00','yyyy-MM-dd HH24:mi:ss')and his.machine = '机器名'LEFT JOIN v$session_wait st on st.sid = his.session_idORDER BY his.sample_time desc

4.为什么这个sql会阻塞呢?

是没有加索引吗?看了下有加索引,数据库测试更新速度只有几毫秒,不应该超时啊,再次陷入疑团。。。。。

5.排除sql性能问题,再回去看jstack调用栈,发现有AopProxy之类的调用,难道spring事务没提交?

看下切面配置,惊呆了!!,整个事务切面放在Rpc类这一层,而Rpc有for循环。。。

那就是事务提交不及时造成的sql阻塞,修改事务配置,最小力度配置,一切正常了!所以事务千万不要乱配!

6.最后还有个疑问:为什么Rpc调用的时候没有任何业务异常栈log输出,如果是数据库超时了那么Rpc应该是能够捕获该异常才对啊?

对于Rpc没有捕获任何异常这个现象,造成原因是Rpc请求默认超时时间是5S,而数据库阻塞一直等待超过5S,Rpc服务等待超时5S后应该是强制结束了改线程任务,而数据库阻塞任继续等待没有向上抛异常,所以Rpc服务无法log该线程调用栈信息

你可能感兴趣的:(java)