一、说在前面的话
相信很多人都会使用logback对异常打印日志,本文主要是通过一次问题排查,过程漫长曲折,特此记录下来,仅供参考。
二、feign调用报错
2021-05-13 15:50:08.068 ERROR [xxx,2dc51443f8756306,0cfeb5fe0e739878,false] 445 --- [cTaskExecutor-1] c.x.e.call.service.CallUserService : 获取用户信息失败 [userIds=246529 e=RESTUserService#getUserClient(String,String,String) failed and no fallback available.]
排查问题的第一步,从上面的错误日志出发,看得出是哪个方法,重要的入参信息。
注意这里使用了hystrix熔断框架,本服务在调用用户服务的时候,出现了常见的错误。
failed and no fallback available.
注意,不是timed-out and no fallback available. 是failed,不是timed-out。
》》》》这里补充下背景,
@RabbitListener(queues = "xxx")
@RabbitHandler
public void pushMessage(String request) {
//1、调用用户服务的接口,查询用户信息
//2、调用推送服务,给用户推送提醒
}
三、排查思路
1、hystrix在调用用户服务的时候,出现了报错,且没有配置fallback。
@FeignClient(name = "user-service", url = "${user-service.url}")
public interface IUserService{
@RequestMapping(value = "/api/v1/users/{userId}", method = RequestMethod.GET)
UserInfoResult getUserInfo(@PathVariable("userId") Integer userId);
@RequestMapping(value = "/api/v1/users/client", method = RequestMethod.GET)
List getUserClient(@RequestParam("userIds") String userIds);
}
2、这里遇到过一个弯路,就是业务方反馈来的信息是:调用用户服务的接口"/api/v1/users/client"是间歇性地出错,影响面是一部分用户,而且同时是调用用户服务,其他接口比如"/api/v1/users/{userId}"是正常的。
我们从用户服务负责人得到的反馈内容是:该接口响应时间在几百毫秒,查询的是mysql数据库,并且用户表记录数几十万,且查询字段用上了索引。
3、我们忽略了报错信息的源头,也可以说正因为报错日志打印的姿势不对,无法定位出问题。转而根据调用双方的信息去排查问题。
4、既然日志看不出原因,排查问题的下一步就是证实服务提供方是否正常。
转向Pinpoint,经核实,接口响应在几百毫秒,而且其他对接业务方调用本接口正常。
对于http调用接口,在pinpoint并不能区分出来是内网的哪个服务调用的!!因为是http get接口,双方也不会去打印任何日志。
5、接下来,为了确定服务方的接口正常,我们还让用户服务负责人直接在数据库进行sql执行计划跟踪,目的是查询快慢,有无用到索引。入参就是业务方打印出来的userIds。
看下来,也是飞快,因为数据量本来就不大。
6、经过4和5两步,我们可以确定服务提供方是清白的,转而去分析为什么业务方会触发hystrix熔断。苦于业务方的日志打印姿势不对,得到的信息量太少,要命的是业务方并不确定接口的成功和失败、以及调用情况。
7、接口没有慢且服务方可用,那会不会是业务方调用接口太频繁导致??我们有应用监控prometheus。但是只有对http请求的监控,并没有mq协议的监控。(出现问题的接口是在监听mq消息中触发的)所以,我们并看不到自身服务调用用户服务的该接口的频率是不是太高。。。
8、思路在这里就被中断了,于是百度看看,到底报错 " failed and no fallback available." 会是什么原因导致呢?反正是和hystrix脱离不开了。要是有对hystrix的监控就好了,看看都触发了什么hystrix事件。幸运的是,我们的业务方接入了hystrix监控。
下面贴出对它的一个监控界面:
既然有监控,顺便看看它的接口执行耗时情况:
前面也说了,内网的接口调用,加上用户服务的本接口的响应时间很快。这里看到的也是耗时在几十毫秒。
再对事件的具体分析,也可以看出接口调用没有触发熔断,也没有线程池的拒绝事件。满屏的都是failure事件。
9、再对比和其他接口的调用正常,到这里,我们可以断定就这个接口的feign调用写法有问题,别忘记了java可是一个实打实的强类型语言。
别无发他,晚上我看起来了双方的代码,不幸的是眼神不好使,对比了字段名都一致的。(这里埋下一个坑。。)
10、第二天,经过讨论,既然利用监控找到了是这个方法的问题,我们拿起另外一个武器----阿里的arthas。正因为工具太强大了,我们要拿他来做什么呢?最前文,我们就说了,异常信息被捕获了。我们想要的异常堆栈明细信息,才能看出问题出在哪?
11、哦,我们要看的是异常信息以及方法的出入参值是什么。
watch com.xxx.call.IUserService getUserClient "{params,returnObj,throwExp}" -e -x 4 -n 20
观察一段时间,便顺利获得了详细错误。见下
>
Press Q or Ctrl+C to abort.
Affect(class count: 2 , method count: 1) cost in 494 ms, listenerId: 2
method=com.sun.proxy.$Proxy214.getUserClient location=AtExceptionExit
ts=2021-05-13 14:38:30; [cost=15.658889ms] result=@ArrayList[
@Object[][
null,
@String[TT],
@String[300054],
],
null,
com.netflix.hystrix.exception.HystrixRuntimeException: RESTUserService#getUserClient(String,String,String) failed and no fallback available.
at com.netflix.hystrix.AbstractCommand$22.call(AbstractCommand.java:819)
at com.netflix.hystrix.AbstractCommand$22.call(AbstractCommand.java:804)
at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$4.onError(OperatorOnErrorResumeNextViaFunction.java:140)
at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
at com.netflix.hystrix.AbstractCommand$DeprecatedOnFallbackHookApplication$1.onError(AbstractCommand.java:1472)
at com.netflix.hystrix.AbstractCommand$FallbackHookApplication$1.onError(AbstractCommand.java:1397)
at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
at rx.observers.Subscribers$5.onError(Subscribers.java:230)
at rx.internal.operators.OnSubscribeThrow.call(OnSubscribeThrow.java:44)
at rx.internal.operators.OnSubscribeThrow.call(OnSubscribeThrow.java:28)
at rx.Observable.unsafeSubscribe(Observable.java:10256)
at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:51)
at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:35)
at rx.Observable.unsafeSubscribe(Observable.java:10256)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
at rx.Observable.unsafeSubscribe(Observable.java:10256)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30)
at rx.Observable.unsafeSubscribe(Observable.java:10256)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
at rx.Observable.unsafeSubscribe(Observable.java:10256)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
at rx.Observable.unsafeSubscribe(Observable.java:10256)
at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$4.onError(OperatorOnErrorResumeNextViaFunction.java:142)
at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
at com.netflix.hystrix.AbstractCommand$HystrixObservableTimeoutOperator$2.onError(AbstractCommand.java:1194)
at rx.internal.operators.OperatorSubscribeOn$SubscribeOnSubscriber.onError(OperatorSubscribeOn.java:80)
at rx.observers.Subscribers$5.onError(Subscribers.java:230)
at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
at rx.observers.Subscribers$5.onError(Subscribers.java:230)
at com.netflix.hystrix.AbstractCommand$DeprecatedOnRunHookApplication$1.onError(AbstractCommand.java:1431)
at com.netflix.hystrix.AbstractCommand$ExecutionHookApplication$1.onError(AbstractCommand.java:1362)
at rx.observers.Subscribers$5.onError(Subscribers.java:230)
at rx.observers.Subscribers$5.onError(Subscribers.java:230)
at rx.internal.operators.OnSubscribeThrow.call(OnSubscribeThrow.java:44)
at rx.internal.operators.OnSubscribeThrow.call(OnSubscribeThrow.java:28)
at rx.Observable.unsafeSubscribe(Observable.java:10256)
at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:51)
at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:35)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
at rx.Observable.unsafeSubscribe(Observable.java:10256)
at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:51)
at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:35)
at rx.Observable.unsafeSubscribe(Observable.java:10256)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
at rx.Observable.unsafeSubscribe(Observable.java:10256)
at rx.internal.operators.OperatorSubscribeOn$SubscribeOnSubscriber.call(OperatorSubscribeOn.java:100)
at com.netflix.hystrix.strategy.concurrency.HystrixContexSchedulerAction$1.call(HystrixContexSchedulerAction.java:56)
at com.netflix.hystrix.strategy.concurrency.HystrixContexSchedulerAction$1.call(HystrixContexSchedulerAction.java:47)
at com.netflix.hystrix.strategy.concurrency.HystrixContexSchedulerAction.call(HystrixContexSchedulerAction.java:69)
at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: feign.codec.DecodeException: Error while extracting response for type [java.util.List] and content type [application/json;charset=UTF-8]; nested exception is org.springframework.http.converter.HttpMessageNotReadableException: JSON parse error: Numeric value (1620876433140) out of range of int
at [Source: (PushbackInputStream); line: 1, column: 134]; nested exception is com.fasterxml.jackson.databind.JsonMappingException: Numeric value (1620876433140) out of range of int
at [Source: (PushbackInputStream); line: 1, column: 134]
at [Source: (PushbackInputStream); line: 1, column: 121] (through reference chain: java.util.ArrayList[0]->com.xxx.call.dto.response.UserClientResponse["registeredAt"])
at feign.SynchronousMethodHandler.decode(SynchronousMethodHandler.java:169)
at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:133)
at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:76)
at feign.hystrix.HystrixInvocationHandler$1.run(HystrixInvocationHandler.java:108)
at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:302)
at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:298)
at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:46)
... 27 more
Caused by: org.springframework.web.client.RestClientException: Error while extracting response for type [java.util.List] and content type [application/json;charset=UTF-8]; nested exception is org.springframework.http.converter.HttpMessageNotReadableException: JSON parse error: Numeric value (1620876433140) out of range of int
at [Source: (PushbackInputStream); line: 1, column: 134]; nested exception is com.fasterxml.jackson.databind.JsonMappingException: Numeric value (1620876433140) out of range of int
at [Source: (PushbackInputStream); line: 1, column: 134]
at [Source: (PushbackInputStream); line: 1, column: 121] (through reference chain: java.util.ArrayList[0]->com.xxx.call.dto.response.UserClientResponse["registeredAt"])
at org.springframework.web.client.HttpMessageConverterExtractor.extractData(HttpMessageConverterExtractor.java:115)
at org.springframework.cloud.openfeign.support.SpringDecoder.decode(SpringDecoder.java:60)
at org.springframework.cloud.openfeign.support.ResponseEntityDecoder.decode(ResponseEntityDecoder.java:45)
at feign.optionals.OptionalDecoder.decode(OptionalDecoder.java:23)
at feign.SynchronousMethodHandler.decode(SynchronousMethodHandler.java:165)
... 33 more
Caused by: org.springframework.http.converter.HttpMessageNotReadableException: JSON parse error: Numeric value (1620876433140) out of range of int
at [Source: (PushbackInputStream); line: 1, column: 134]; nested exception is com.fasterxml.jackson.databind.JsonMappingException: Numeric value (1620876433140) out of range of int
at [Source: (PushbackInputStream); line: 1, column: 134]
at [Source: (PushbackInputStream); line: 1, column: 121] (through reference chain: java.util.ArrayList[0]->com.xxx.call.dto.response.UserClientResponse["registeredAt"])
at org.springframework.http.converter.json.AbstractJackson2HttpMessageConverter.readJavaType(AbstractJackson2HttpMessageConverter.java:241)
at org.springframework.http.converter.json.AbstractJackson2HttpMessageConverter.read(AbstractJackson2HttpMessageConverter.java:223)
at org.springframework.web.client.HttpMessageConverterExtractor.extractData(HttpMessageConverterExtractor.java:100)
... 37 more
Caused by: com.fasterxml.jackson.databind.JsonMappingException: Numeric value (1620876433140) out of range of int
at [Source: (PushbackInputStream); line: 1, column: 134]
at [Source: (PushbackInputStream); line: 1, column: 121] (through reference chain: java.util.ArrayList[0]->com.xxx.call.dto.response.UserClientResponse["registeredAt"])
at com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:391)
at com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:351)
at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.wrapAndThrow(BeanDeserializerBase.java:1704)
at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:371)
at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:159)
at com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:286)
at com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:245)
at com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:27)
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4001)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3072)
at org.springframework.http.converter.json.AbstractJackson2HttpMessageConverter.readJavaType(AbstractJackson2HttpMessageConverter.java:235)
... 39 more
Caused by: com.fasterxml.jackson.core.JsonParseException: Numeric value (1620876433140) out of range of int
at [Source: (PushbackInputStream); line: 1, column: 134]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1804)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:663)
at com.fasterxml.jackson.core.base.ParserBase.convertNumberToInt(ParserBase.java:869)
at com.fasterxml.jackson.core.base.ParserBase._parseIntValue(ParserBase.java:801)
at com.fasterxml.jackson.core.base.ParserBase.getIntValue(ParserBase.java:645)
at com.fasterxml.jackson.databind.deser.std.NumberDeserializers$IntegerDeserializer.deserialize(NumberDeserializers.java:472)
at com.fasterxml.jackson.databind.deser.std.NumberDeserializers$IntegerDeserializer.deserialize(NumberDeserializers.java:452)
at com.fasterxml.jackson.databind.deser.impl.FieldProperty.deserializeAndSet(FieldProperty.java:136)
at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:369)
... 46 more
,
四、总结
有上面的堆栈信息,相信大家都能知道了问题所在,问题指向接口的返回值对象----类"com.xxx.call.dto.response.UserClientResponse"
JSON parse error: Numeric value (1620876433140) out of range of int
(through reference chain: java.util.ArrayList[0]->com.xxx.call.dto.response.UserClientResponse["registeredAt"])
再对比一下两边对registeredAt的类型,一个是int类型,一个是long类型,从而导致数据越界。
从这个案例,我们得到一个血的教训,日志打印异常,不将堆栈打印出来,真是要命。
同理,可以看出来,排查这个问题,监控也同样重要,它能让我们缩小问题范围,距离真相也越来越近了。
五、补充一下hystrix的执行流程图