Distributed locking is a mechanism used to coordinate access and modification of shared resources in a distributed system, like a microservices architecture or a cluster of databases. It prevents multiple processes or nodes from accessing the same resource simultaneously, which can lead to data inconsistencies, race conditions, and deadlocks.
Distributed locking is a mechanism used in distributed systems to control multiple nodes’ access to shared resources, ensuring that at any given moment only one node can perform specific operations. These locks are typically employed to safeguard data consistency and integrity, preventing concurrent operations from leading to data conflicts and errors.
There are various algorithms for implementing locks in distributed systems, and Redlock is an algorithm based on Redis for distributed locking. It operates by running five Redis replicas and using majority voting to acquire locks, aiming to enhance system fault tolerance and availability. However, Martin Kleppmann, in his blog post, delves deeply into the Redlock algorithm and points out its potential issues and shortcomings.
Firstly, Kleppmann argues that if locks are solely used for efficiency, to prevent unnecessary duplicate work, then there is no need for complex algorithms like Redlock. In such cases, a straightforward single-node Redis instance might suffice, as occasional lock losses would not lead to significant consequences.
However, when the correctness of the locks is critical, for instance, in operations such as updating shared storage systems, performing calculations, or calling external APIs, it is essential to ensure the validity of the locks. In these scenarios, Kleppmann believes that Redlock is not suitable because it lacks a mechanism for generating fencing tokens, which are necessary to guarantee the safety of the locks. Fencing tokens are incrementing numbers generated by the lock service each time a client acquires the lock, ensuring that even in the event of network delays or paused processes, write requests are handled correctly.
Furthermore, Kleppmann discusses the reliance of the Redlock algorithm on timing assumptions. The algorithm presumes that network delays, process pauses, and clock errors are limited and that the impact of these factors is less than the lifetime of the lock. However, in practice, these assumptions may not hold true; for example, network delays can result in prolonged packet transmission, or adjustments to the system clock can lead to inaccuracies in the lock’s timeout settings. Therefore, once these assumptions are violated, Redlock may violate its safety properties, posing a risk to the security of the locks.
Ultimately, Kleppmann suggests that if locks are needed for correctness, a consensus system like ZooKeeper should be chosen and the use of fencing tokens enforced for all resource accesses under the lock. For locks based on best efforts (i.e., as an efficiency optimization rather than for correctness), he recommends using a single Redis instance and clearly documenting in the code that the locks are approximate and may occasionally fail.
In conclusion, Kleppmann’s article highlights key considerations in designing distributed locking solutions, including the purpose of the locks, the trade-off between correctness and efficiency, and the dependency of algorithms on timing and system model assumptions. His insights remind us to be cautious in selecting and employing distributed locking solutions to ensure the reliability and stability of our systems.
分布式锁是分布式系统中用于控制多个节点对共享资源访问的一种机制,确保在任意时刻只有一个节点能够执行特定操作。这种锁通常用于保护数据的一致性和完整性,防止并发操作导致的数据冲突和错误。
在分布式系统中实现锁的算法有多种,其中Redlock算法是一个基于Redis实现的分布式锁方案。它通过运行五个Redis副本并采用多数派投票的方式来获取锁,旨在提高系统的容错性和可用性。然而,Martin Kleppmann在其博客文章中深入分析了Redlock算法,并指出了其潜在的问题和不足之处。
首先,Kleppmann指出,如果使用锁仅仅是为了提高效率,避免重复工作,那么没有必要采用复杂的Redlock算法。在这种情况下,一个简单的单节点Redis实例可能就足够了,因为即使偶尔丢失一些锁,也不会带来严重的后果。
然而,当锁的正确性对于系统至关重要时,例如在写入共享存储系统、执行计算任务或调用外部API等操作中,必须确保锁的有效性。这时,Kleppmann认为Redlock算法并不适用,因为它缺乏生成栅栏令牌(fencing tokens)的机制来保证锁的安全性。栅栏令牌是一种递增的数字,每次客户端获取锁时由锁服务生成,用于确保即使在网络延迟或进程暂停的情况下,也能正确地处理写请求。
此外,Kleppmann还讨论了Redlock算法对时间假设的依赖问题。算法假定网络延迟、进程暂停和时钟误差都是有限的,并且这些因素的影响都小于锁的生命周期。但在实际环境中,这些假设可能不成立,例如网络延迟可能导致长时间的数据包传输,或者系统时钟的调整可能导致锁的超时设置不准确。因此,一旦这些假设被打破,Redlock算法可能会违反其安全性属性,导致锁的安全性问题。
最后,Kleppmann建议,如果需要为正确性而使用锁,应该选择像ZooKeeper这样的共识系统,并强制在所有受锁保护的资源访问中使用栅栏令牌。而对于仅以最佳努力为基础的锁(即作为效率优化而非正确性保障),他建议使用单个Redis实例,并明确记录代码中的锁仅为近似值,可能会偶尔失败。
总之,Kleppmann的文章强调了在设计分布式锁时需要考虑的关键因素,包括锁的目的、正确性与效率之间的权衡,以及算法对时间和系统模型假设的依赖性。他的观点提醒我们在选择和使用分布式锁解决方案时应谨慎,以确保系统的可靠性和稳定性。
https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
https://dl.acm.org/doi/pdf/10.1145/2639988.2655736