Replication means keeping a copy of the same data on multiple machines that connected via a network. There are several reasons of why we want to replicate data:
Each node that stores a copy of the database is called a replica. Every write to the database needs to be processed by every replica. The most common solution for this is called leader-based replication (also known as active/passive or master-slave replication).
An important detail of a replicated system is whether the replication happens synchronously or asynchronously.
In this example, the replication to follower 1 is synchronous: the leader waits until follower 1 has confirmed that it received the write before reporting success to the user. The replication to follow 2 is asynchronous.
As we can see, there is a substantial delay before follower 2 processes the message. There are several reasons, for example, this follower is recovering from a failure, this system is operating near maximum capacity, or there are network problems.
synchronous | asynchronous | semi-synchronous |
---|---|---|
The follower is guaranteed to have an up-to-date copy of the data. But the leader must block all writes and wait until the follower responds although the follower has crashed. | The most common mode. If the leader fails and is not recoverable, any writes that have not yet been replicated to followers are lost. But the leader can continue processing writes, even if all of its followers have fallen behind. | One of the followers is synchronous and the others are asynchronous. If a synchronous follower crashes, one of the asynchronous followers is make synchronous. |
We should ensure the newly added follower has an accurate copy of the leader’s data.
The data is always in flux, thus simply copying data files from one node to the new node makes sense. Locking the database to make the files consistent also goes against our goal of high availability.
Take a consistent snapshot of the leader’s database, and the snapshot is associated with an exact position in the leader’s replication log. Then copy the snapshot to the new follower node. When the follower has processed the backlog of data changes since the snapshot, we say it has caught up.
We should keep the system as a whole running despite individual node failures and to keep the impact of a node outage as small as possible.
Each follower keeps a log of the data changes in its local disk, the follower can easily recover the last transaction that was processed before the fault occurred. Thus, the follower can connect to the leader and request all the data changes that occurred during the time when the follower crashed.
When the leader has a failure, one of the followers needs to be promoted to be the new leader. This process is called failover. An automatic failover process usually consists of the following steps:
Failover is fraught with things that can go wrong:
In the simplest case, the leader logs every write request (INSERT, UPDATE, DELETE) that it executes and sends that statement log to its followers. This approach has many disadvantages:
Although the leader can replace any nondeterministic function calls with a fixed return value when the statement is logged, there are so many edge cases so we prefer other methods.
This method seems like other other knowledge we discussed:
An alternative is to use different log formats for replication and for the storage engine, this kind of replication log is called a logical log. A logical log for a relational database is usually a sequence of records describing writes to database tables at the granularity of a row:
Trigger-based replication has greater overheads than other replication methods, and is more prone to bugs and limitations than the database’s built-in replication. However, it has higher flexibility:
In a system that consists of mostly reads, we have an option: create many followers and distribute read requests across those followers.
If you read from an asynchronous follower, you may see different data from leader. But this inconsistency is just a temporary state, the follower will eventually catch up. This effect is called eventual consistency.
Sometimes, the delay between a write happening on the leader and the follower may be just a second, but in some cases, such as network error, this lag can be easily increased to several seconds.
In this situation, we need read-after-write consistency, also known as read-your-writes consistency. There are various possible techniques to implement read-after-write consistency:
Maybe your application supports access from multiple devices, such as a web browser and a mobile app. In this case, there are some new issues to consider:
Reading from asynchronous followers may cause the error as shown in the below figure, known as moving backward in time.
Monotonic reads guarantee eventual consistency, the user will not read older data after having previously read newer data. One way of achieving monotonic reads is to make sure that each user always makes their reads from the same replica (for example, the replica can be chosen based on the hash of the user ID).
This example violation of causality.
Preventing this kind of anomaly requires another type of guarantee: consistent prefix reads. This guarantee says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.
This is a particular problem in partitioned (sharded) databases because in many databases, different partitions operate independently, when a user reads from the database, they may see some older data.
One solution is to make sure that any writes that are causally related to each other are written to the same partition.
In a single-node system, transaction is a way for a database to provide stronger guarantees.
In a distributed system, transactions are too expensive in terms of performance and availability.
In a single-leader system, if you can’t connect to the leader for any reason, you can’t write to the database. A natural extension of the leader-based replication model is to allow more than one node to accept write, as known to multi-leader configuration (also known as master-master or active/active replication). In this setup, each leader simultaneously acts as a follower to the other leaders.
The benefits of multi-leader configuration rarely outweigh the added complexity, however, there are some situations in which this configuration is reasonable.
In a multi-leader configuration, you can have a leader in each datacenter.
single-leader configuration | multi-leader configuration | |
---|---|---|
Performance | Every write must go over the internet to the datacenter with the leader. This can add significant latency to writes. | Every write can be processed in the local datacenter and is replicated asychronously to the other datacenters. The interdatacenter network delay is hidden from users. |
Tolerance of datacenter outages | If the datacenter with the leader fails, failover can promote a follower in another datacenter to be leader. | Each datacenter can run independently of the others, and the replication catches up when the failed data comes back online. |
Tolerance of network problems | Sensitive to problems. | A temporary network interruption does not prevent writes being processed. |
Multi-leader configurations is often implemented with external tools, such as Tungsten Replicator for MySQL, BDR for PostgreSQL, and GoldenGate for Oracle.
Multi-leader replication has a big downside: the same data may be concurrently modified in two different datacenters, and those write conflicts must be resolved.
Multi-leader replication is often considered dangerous territory that should be avoided if possible.
Another situation in which multi-leader replication is appropriate is if you have an application that needs to continue to work while it is disconnected from the internet.
For example, you need to be able to see your schedule on any device, regardless of whether your device currently has an internet connection. If you make any changes while you are offline, they need to be synced with a server and your other devices when the device is next online.
In this case, every device has a local database that acts as a leader (it accepts write requests), and there is an asynchronous multi-leader replication process (sync) between the replicas of your calendar on all of your devices.
From an architectural point of view, this setup is essentially the same as multi-leader replication between datacenters: each device is a “datacenter”.
Real-time collaborative editing applications allow several people to edit a document simultaneously.
If you want to guarantee that there will be no editing conflicts, the application must obtain a lock on the document before a user can edit it. However, for faster collaboration, you may want to make the unit of change very small (e.g., a single keystroke) and avoid locking.
The biggest problem with multi-leader replication is that write conflicts can occur, which means that conflict resolution is required.
Although we can make the conflict detection synchronous, however, by doing so, you would lose the main advantage of multi-leader replication.
The simplest strategy for dealing with conflicts is to avoid them: if the application can ensure that all writes for a particular record go through the same leader, then conflicts cannot occur.
However, sometimes you might want to change the designated leader for a record—perhaps because one datacenter has failed and you need to reroute traffic to another datacenter, or perhaps because a user has moved to a different location and is now closer to a different datacenter.
The database must solve the write conflict problem in a convergent way, which means that all replicas must arrive at the same final value when changes have been replicated.
There are various ways of achieving convergent conflict resolution:
As the most appropriate way of resolving a conflict may depend on the application.
That code may be executed on write or on read:
Some kinds of conflict are obvious.
Other kinds of conflict are subtle. For example, if a meeting room booking system allows a room to be booked by two different leaders.
A replication topology describes the communication paths along which writes are propagated from one node to another.
In circular and star topology, a write may need to pass through one node several times before it reaches all replicas. In order to prevent infinite replication loops, nodes are given a unique identifier, and each write is tagged with the identifier of all nodes it has passed through in the replication log.
In circular and star topology, if a node fails, it can interrupt the flow of replication messages between other nodes.
In all-to-all topology, some network links may be faster than others, with the result that some replication messages may "overtake " others, as illustrated in the image below.
This problem is similar to the one we saw in “Consistent Prefix Reads”, but simply attaching a timestamp to every write is not sufficient, because clocks cannot be trusted to be sufficiently in sync to correctly order these events at leader 2.
To order these events correctly, a technique called version vectors can be used.
In some leaderless implementations, the client directly sends its writes to several replicas, while in others, a coordinator node does this on behalf of the client. However, unlike a leader database, that coordinator does not enforce a particular ordering of writes.
In leaderless replication, failover doesn’t exist. In the below image, there are two out of three replicas to accept the write, we believe this write is a success.
When clients read from the database, they should read from several nodes in parallel, they will use the version number to determine which value is newer.
After an unavailable node comes back online, how does it catch up on the writes that it missed?
If there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. (In our example, n = 3, w = 2, r = 2.) As long as w + r > n, we expect to get an up-to-date value when reading, because at least one of the r nodes we’re reading from must be up to date. Reads and writes that obey these r and w values are called quorum reads and writes. You can think of r and w as the minimum number of votes required for the read or write to be valid.
A common choice is to make n an odd number and to set w = r = (n + 1) / 2.
Even with w + r > n, there are likely to be edge cases where stale values are returned.
Although our application can tolerate stale reads, we still need to investigate the cause. For leaderless replication, there is no fixed order in which writes are applied which makes monitoring more difficult.
Databases with appropriately configured quorums can tolerate the failure of individual nodes without the need for failover. This makes databases with leaderless replication have high availability and low latency.
If some nodes are disconnected, we have to face a trade-off:
In order to become eventually consistent, the replicas should converge toward the same value. If you want to avoid losing data, you should know a lot about the internals of your database’s conflict handling.
Even though the writes don’t have a natural ordering, we can force an arbitrary order on them. For example, we can attach a timestamp to each write, pick the biggest timestamp as the most “recent” and discard others, this resolution algorithm is called LWW.
If there are several concurrent writes to the same key, LWW may cost durability. If losing data is unacceptable, LWW is a poor choice for conflict resolution.
The only safe way to use LWW is to ensure that a key is only written once and thereafter treated as immutable.
Whenever you have two operations A and B, there are three possibilities: either A happened before B, or B happened before A, or A and B are concurrent. We should decide whether two operations are concurrent or one operation causally dependent on another.
Operations 1 and 2 are concurrent, so the server stores the data separately and gives them a different version number. When the client wants to update the value, they should send a request with the version number they recently received.
In this example, the clients are never fully up to date with the data on the server, since there is another operation going on concurrently. But old versions of the value do get overwritten eventually, and no data is lost.
The dataflow is illustrated graphically in the below image. The arrows indicate which operation happened before the other operation.
The algorithm works as follows:
This algorithm ensures that no data is silently dropped, but it requires that the clients do some extra work. If several operations happen concurrently, we call these concurrent values siblings.
Merging sibling values is essentially the same problem as conflict resolution in multileader replication. In the previous example, the two final siblings are [milk, flour, eggs, bacon] and [eggs, milk, ham]; note that milk and eggs appear in both.
If clients not only just add things, but remove things from their carts, then taking the union of siblings may not yield the right result: if you merge two sibling carts and an item has been removed in only one of them, then the removed item will reappear in the union of the siblings. To solve this problem, an item should leave a marker with a version number when it is deleted. This deletion marker is called a tombstone.
In the previous example, we only use a single replica, in the following we will discuss how the algorithm changes when there are multiple replicas.
We need to use a version number per replica as well as per key. Each replica increments its own version number when processing a write, and also keeps track of the version numbers it has seen from each of the other replicas.
The collection of version numbers from all the replicas is called a version vector. Version vectors are sent from the database replicas to clients when values are read, and need to be sent back to the database when a value is subsequently written. The version vector allows the database to distinguish between overwrites and concurrent writes.
Replication can serve several purposes:
We discussed three main approaches to replication:
Single-leader replication is fairly easy to understand and there is no conflict resolution to worry about. Multi-leader and leaderless replication can be more robust in the presence of faulty nodes, network interruptions, and latency spikes—at the cost of being harder to reason about and providing only very weak consistency guarantees.
Replication has two ways to update data:
We discussed a few consistency models that are helpful for deciding how an application should behave under replication lag:
Finally, we discussed the consistency issues in multi-leader and leaderless replication approaches (because they allow multiple writes to happen concurrently). We examined an algorithm that a database might use to determine whether one operation happened before another, or whether they happened concurrently. We also touched on methods for resolving conflicts by merging together concurrent updates.