警惕看不见的重试机制：为什么使用RPC必须考虑幂等性

在RPC场景中因为重试或者没有实现幂等性而导致的重复数据问题，必须引起大家重视，有可能会造成例如一次购买创建多笔订单，一条通知信息被发送多次等问题，这是技术人员必须面对和解决的问题。

有人可能会说：当调用失败时程序并没有显示重试，为什么还会产生重复数据问题呢？这是因为即使没有显示重试，RPC框架在集群容错机制中自动进行了重试，这个问题必须引起关注。

本文我们以DUBBO框架为例分析为什么重试，怎么做重试，怎么做幂等三个问题。

1 为什么重试

如果简单对一个RPC交互过程进行分类，可以分为三类：响应成功、响应失败、没有响应。

对于响应成功和响应失败两种情况，消费者很好处理。因为响应信息明确，所以只要根据响应信息，继续处理成功或者失败逻辑即可。但是没有响应这种场景比较难处理，这是因为没有响应可能包含以下情况：

(1) 生产者根本没有接收到请求

(2) 生产者接收到请求并且已处理成功，但是消费者没有接收到响应

(3) 生产者接收到请求并且已处理失败，但是消费者没有接收到响应

假设你是一名RPC框架设计者，究竟是选择重试还是放弃调用呢？其实最终如何选择取决于业务特性，有的业务本身就具有幂等性，但是有的业务不能允许重试否则会造成重复数据。

那么谁对业务特性最熟悉呢？答案是消费者，因为消费者作为调用方肯定最熟悉自身业务，所以RPC框架只要提供一些策略供消费者选择即可。

2，怎么做重试？

2.1 集群容错策略

DUBBO作为一款优秀RPC框架，提供了如下集群容错策略供消费者选择：

Failover：故障转移

Failfast：快速失败

Failsafe：安全失败

Failback：异步重试

Forking：并行调用

Broadcast：广播调用

(1) Failover

故障转移策略。作为默认策略，当消费发生异常时，通过负载均衡策略再选择一个生产者节点进行调用，直到达到重试次数

(2) Failfast

快速失败策略。消费者只消费一次服务，当发生异常时则直接抛出

(3) Failsafe

安全失败策略。消费者只消费一次服务，如果消费失败则包装一个空结果，不抛出异常

(4) Failback

异步重试策略。消费发生异常时返回一个空结果，失败请求将会进行异步重试。如果重试超过最大重试次数还不成功，放弃重试并不抛出异常

(5) Forking

并行调用策略。消费者通过线程池并发调用多个生产者，只要有一个成功就算成功

(6) Broadcast

广播调用策略。消费者遍历调用所有生产者节点，任何一个出现异常则抛出异常

2.2 源码分析

2.2.1 Failover

Failover故障转移策略作为默认策略，当消费发生异常时，通过负载均衡策略再选择一个生产者节点进行调用，直到达到重试次数。即使业务代码没有显示重试，也有可能多次执行消费逻辑从而造成重复数据：

public classFailoverClusterInvokerextendsAbstractClusterInvoker{

publicFailoverClusterInvoker(Directory directory){

super(directory);

}

@Override

publicResultdoInvoke(Invocation invocation,finalList> invokers, LoadBalance loadbalance)throwsRpcException{

// 所有生产者Invokers

List> copyInvokers = invokers;

checkInvokers(copyInvokers, invocation);

String methodName = RpcUtils.getMethodName(invocation);

// 获取重试次数

int len = getUrl().getMethodParameter(methodName, Constants.RETRIES_KEY, Constants.DEFAULT_RETRIES) + 1;

if (len <= 0) {

len = 1;

}

RpcException le = null;

// 已经调用过的生产者

List> invoked = new ArrayList>(copyInvokers.size());

Set providers = new HashSet(len);

// 重试直到达到最大次数

for (int i = 0; i < len; i++) {

if (i > 0) {

// 如果当前实例被销毁则抛出异常

checkWhetherDestroyed();

// 根据路由策略选出可用生产者Invokers

copyInvokers = list(invocation);

// 重新检查

checkInvokers(copyInvokers, invocation);

}

// 负载均衡选择一个生产者Invoker

Invoker invoker = select(loadbalance, invocation, copyInvokers, invoked);

invoked.add(invoker);

RpcContext.getContext().setInvokers((List) invoked);

try {

// 服务消费发起远程调用

Result result = invoker.invoke(invocation);

if (le != null && logger.isWarnEnabled()) {

logger.warn("Although retry the method " + methodName + " in the service " + getInterface().getName() + " was successful by the provider " + invoker.getUrl().getAddress() + ", but there have been failed providers " + providers + " (" + providers.size() + "/" + copyInvokers.size() + ") from the registry " + directory.getUrl().getAddress() + " on the consumer " + NetUtils.getLocalHost() + " using the dubbo version " + Version.getVersion() + ". Last error is: " + le.getMessage(), le);

}

// 有结果则返回

return result;

} catch (RpcException e) {

// 业务异常直接抛出

if (e.isBiz()) {

throw e;

}

le = e;

} catch (Throwable e) {

// RpcException不抛出继续重试

le = new RpcException(e.getMessage(), e);

} finally {

// 保存已经访问过的生产者

providers.add(invoker.getUrl().getAddress());

}

throw new RpcException(le.getCode(), "Failed to invoke the method " + methodName + " in the service " + getInterface().getName() + ". Tried " + len + " times of the providers " + providers + " (" + providers.size() + "/" + copyInvokers.size() + ") from the registry " + directory.getUrl().getAddress() + " on the consumer " + NetUtils.getLocalHost() + " using the dubbo version " + Version.getVersion() + ". Last error is: " + le.getMessage(), le.getCause() != null ? le.getCause() : le);

}

消费者调用生产者节点A发生RpcException异常时（例如超时异常），在未达到最大重试次数之前，消费者会通过负载均衡策略再次选择其它生产者节点消费。试想如果生产者节点A其实已经处理成功了，但是没有及时将成功结果返回给消费者，那么再次重试就可能造成重复数据问题。

2.2.2 Failfast

快速失败策略。消费者只消费一次服务，当发生异常时则直接抛出，不会进行重试：

public classFailfastClusterInvokerextendsAbstractClusterInvoker{

publicFailfastClusterInvoker(Directory directory){

super(directory);

}

@Override

publicResultdoInvoke(Invocation invocation, List> invokers, LoadBalance loadbalance)throwsRpcException{

// 检查生产者Invokers是否合法

checkInvokers(invokers, invocation);

// 负载均衡选择一个生产者Invoker

Invoker invoker = select(loadbalance, invocation, invokers, null);

try {

// 服务消费发起远程调用

return invoker.invoke(invocation);

} catch (Throwable e) {

// 服务消费失败不重试直接抛出异常

if (e instanceof RpcException && ((RpcException) e).isBiz()) {

throw (RpcException) e;

}

throw new RpcException(e instanceof RpcException ? ((RpcException) e).getCode() : 0,

"Failfast invoke providers " + invoker.getUrl() + " " + loadbalance.getClass().getSimpleName()

+ " select from all providers " + invokers + " for service " + getInterface().getName()

+ " method " + invocation.getMethodName() + " on consumer " + NetUtils.getLocalHost()

+ " use dubbo version " + Version.getVersion()

+ ", but no luck to perform the invocation. Last error is: " + e.getMessage(),

e.getCause() != null ? e.getCause() : e);

}

2.2.3 Failsafe

安全失败策略。消费者只消费一次服务，如果消费失败则包装一个空结果，不抛出异常，不会进行重试：

public classFailsafeClusterInvokerextendsAbstractClusterInvoker{

private static final Logger logger = LoggerFactory.getLogger(FailsafeClusterInvoker.class);

publicFailsafeClusterInvoker(Directory directory){

super(directory);

}

@Override

publicResultdoInvoke(Invocation invocation, List> invokers, LoadBalance loadbalance)throwsRpcException{

try {

// 检查生产者Invokers是否合法

checkInvokers(invokers, invocation);

// 负载均衡选择一个生产者Invoker

Invoker invoker = select(loadbalance, invocation, invokers, null);

// 服务消费发起远程调用

return invoker.invoke(invocation);

} catch (Throwable e) {

// 消费失败包装为一个空结果对象

logger.error("Failsafe ignore exception: " + e.getMessage(), e);

return new RpcResult();

}

2.2.4 Failback

异步重试策略。消费发生异常时返回一个空结果，失败请求会进行异步重试。如果重试超过最大重试次数还不成功，放弃重试并不抛出异常：

public classFailbackClusterInvokerextendsAbstractClusterInvoker{

private static final Logger logger = LoggerFactory.getLogger(FailbackClusterInvoker.class);

private static final long RETRY_FAILED_PERIOD = 5;

private final int retries;

private final int failbackTasks;

private volatile Timer failTimer;

publicFailbackClusterInvoker(Directory directory){

super(directory);

int retriesConfig = getUrl().getParameter(Constants.RETRIES_KEY, Constants.DEFAULT_FAILBACK_TIMES);

if (retriesConfig <= 0) {

retriesConfig = Constants.DEFAULT_FAILBACK_TIMES;

}

int failbackTasksConfig = getUrl().getParameter(Constants.FAIL_BACK_TASKS_KEY, Constants.DEFAULT_FAILBACK_TASKS);

if (failbackTasksConfig <= 0) {

failbackTasksConfig = Constants.DEFAULT_FAILBACK_TASKS;

}

retries = retriesConfig;

failbackTasks = failbackTasksConfig;

}

privatevoidaddFailed(LoadBalance loadbalance, Invocation invocation, List> invokers, Invoker lastInvoker){

if (failTimer == null) {

synchronized (this) {

if (failTimer == null) {

// 创建定时器

failTimer = new HashedWheelTimer(new NamedThreadFactory("failback-cluster-timer", true), 1, TimeUnit.SECONDS, 32, failbackTasks);

}

// 构造定时任务

RetryTimerTask retryTimerTask = new RetryTimerTask(loadbalance, invocation, invokers, lastInvoker, retries, RETRY_FAILED_PERIOD);

try {

// 定时任务放入定时器等待执行

failTimer.newTimeout(retryTimerTask, RETRY_FAILED_PERIOD, TimeUnit.SECONDS);

} catch (Throwable e) {

logger.error("Failback background works error,invocation->" + invocation + ", exception: " + e.getMessage());

}

@Override

protectedResultdoInvoke(Invocation invocation, List> invokers, LoadBalance loadbalance)throwsRpcException{

Invoker invoker = null;

try {

// 检查生产者Invokers是否合法

checkInvokers(invokers, invocation);

// 负责均衡选择一个生产者Invoker

invoker = select(loadbalance, invocation, invokers, null);

// 消费服务发起远程调用

return invoker.invoke(invocation);

} catch (Throwable e) {

logger.error("Failback to invoke method " + invocation.getMethodName() + ", wait for retry in background. Ignored exception: " + e.getMessage() + ", ", e);

// 如果服务消费失败则记录失败请求

addFailed(loadbalance, invocation, invokers, invoker);

// 返回空结果

return new RpcResult();

}

@Override

publicvoiddestroy(){

super.destroy();

if (failTimer != null) {

failTimer.stop();

}

/**

* RetryTimerTask

private classRetryTimerTaskimplementsTimerTask{

private final Invocation invocation;

private final LoadBalance loadbalance;

private final List> invokers;

private final int retries;

private final long tick;

private Invoker lastInvoker;

private int retryTimes = 0;

RetryTimerTask(LoadBalance loadbalance, Invocation invocation, List> invokers, Invoker lastInvoker, int retries, long tick) {

this.loadbalance = loadbalance;

this.invocation = invocation;

this.invokers = invokers;

this.retries = retries;

this.tick = tick;

this.lastInvoker = lastInvoker;

}

@Override

publicvoidrun(Timeout timeout){

try {

// 负载均衡选择一个生产者Invoker

Invoker retryInvoker = select(loadbalance, invocation, invokers, Collections.singletonList(lastInvoker));

lastInvoker = retryInvoker;

// 服务消费发起远程调用

retryInvoker.invoke(invocation);

} catch (Throwable e) {

logger.error("Failed retry to invoke method " + invocation.getMethodName() + ", waiting again.", e);

// 超出最大重试次数记录日志不抛出异常

if ((++retryTimes) >= retries) {

logger.error("Failed retry times exceed threshold (" + retries + "), We have to abandon, invocation->" + invocation);

} else {

// 未超出最大重试次数重新放入定时器

rePut(timeout);

}

privatevoidrePut(Timeout timeout){

if (timeout == null) {

return;

}

Timer timer = timeout.timer();

if (timer.isStop() || timeout.isCancelled()) {

return;

}

timer.newTimeout(timeout.task(), tick, TimeUnit.SECONDS);

}

2.2.5 Forking

并行调用策略。消费者通过线程池并发调用多个生产者，只要有一个成功就算成功：

public classForkingClusterInvokerextendsAbstractClusterInvoker{

private final ExecutorService executor = Executors.newCachedThreadPool(new NamedInternalThreadFactory("forking-cluster-timer", true));

publicForkingClusterInvoker(Directory directory){

super(directory);

}

@Override

publicResultdoInvoke(finalInvocation invocation, List> invokers, LoadBalance loadbalance)throwsRpcException{

try {

checkInvokers(invokers, invocation);

final List> selected;

// 获取配置参数

final int forks = getUrl().getParameter(Constants.FORKS_KEY, Constants.DEFAULT_FORKS);

final int timeout = getUrl().getParameter(Constants.TIMEOUT_KEY, Constants.DEFAULT_TIMEOUT);

// 获取并行执行的Invoker列表

if (forks <= 0 || forks >= invokers.size()) {

selected = invokers;

} else {

selected = new ArrayList<>();

for (int i = 0; i < forks; i++) {

// 选择生产者

Invoker invoker = select(loadbalance, invocation, invokers, selected);

// 防止重复增加Invoker

if (!selected.contains(invoker)) {

selected.add(invoker);

}

RpcContext.getContext().setInvokers((List) selected);

final AtomicInteger count = new AtomicInteger();

final BlockingQueue

警惕看不见的重试机制：为什么使用RPC必须考虑幂等性

你可能感兴趣的:(警惕看不见的重试机制：为什么使用RPC必须考虑幂等性)