How is Data Replicated in Distributed Systems?
Understanding Single Leader, Multi-Leader, and Leaderless Replication
As large-scale software applications grow, they face fundamental challenges in increasing both availability (ensuring that if one node is offline, another can take over) and scalability (enhancing the throughput of the service). Setting up multiple nodes with identical copies of data, known as Data Replication, might seem like an obvious solution to these challenges. However, it requires the nodes to coordinate using specific communication strategies to maintain data consistency across all participating hosts. This essay will cover the three main approaches distributed systems use to replicate data across nodes.
Single Leader Replication
As the name suggests, the nodes in the system elect one node as the leader (single leader), and the other nodes follow its instructions. All updates from clients are exclusively handled by the leader node, which then forwards these updates to the follower nodes. Follower nodes update their local state and logs based on the updates received from the leader. If a node goes offline and later rejoins the network, it synchronizes its local state with the leader. However, if the leader goes down, the follower nodes either wait for the leader to come back online or elect a new leader which may bring the system to a halt till the time a leader is online again. Communication from the leader to the followers can be asynchronous (the leader returns a response to the client before sending updates to the followers), synchronous (the leader waits for all followers to receive updates before returning a response to the client), or hybrid (the leader waits for a majority of followers to receive updates before returning a response to the client).
Multi Leader Replication
You guessed it right! In this strategy, the system has multiple leader nodes, each with its follower nodes. All leader nodes are capable of processing updates from clients. To distribute updates to all nodes in the system, the leader nodes synchronize with each other asynchronously, and then each leader passes the updates to its followers. This approach allows for high throughput since clients can write to multiple leader nodes and better failover as in case a leader goes offline, the clients can connect to another leader. However, clients writing to multiple leaders can lead to scenarios where multiple leaders try to update the same data source simultaneously, resulting in write conflicts. This approach requires effective conflict resolution strategies to handle such write conflicts.
Leaderless Replication
As the name suggests, this system lacks an exclusive leader. Instead, the client sends read and write requests to all nodes, ensuring high availability. In this approach, the client sets a minimum number of nodes (quorum) that must acknowledge reads and writes for the operation to be considered valid. For example, if there are five nodes in the system and the client has a quorum value of three, the client will send requests to all five nodes, and as long as three nodes acknowledge the requests, the operation is deemed successful. This mitigates data inconsistency issues if up to two out of five nodes (in this scenario) are slow to update or offline. In Leaderless replication, each node shares the updates it received with all or a subset of other nodes, eventually making the system consistent. This method of nodes sharing updates is often referred to as gossiping (gossip protocols).