What is the Two Generals Problem in Distributed Systems?
Understanding How Networks Behave in Distributed Systems
Network reliability is one of the mistaken beliefs about distributed systems that focuses on network behavior. When designing large-scale applications, we often assume some messages will be lost during communication between two nodes. This message loss can happen for several reasons: power failures, cable cuts, node failures, network congestion, etc. System entities often resend requests for retryable errors or timeouts until the acknowledgments are received to improve a system's reliability and reduce the percentage of lost messages. This approach allows us to get a confirmation that the receiver received the message. In this article, we will understand the Two Generals Problem, an analogy to communication between any two nodes through a network.
The Two Generals Problem
The Two Generals Problem is a thought experiment to understand how networks behave in a distributed system. Imagine two generals, each leading an army. Both of them want to capture a city. The only scenario in which the city can be captured is if both armies attack at the same time. If only one of them attacks, they get defeated.
The two are on different sides of the city, and to communicate and coordinate the attack plan, the messengers from each army need to pass through the city, where the city can capture them. Due to this, a message sent by one general may or may not be received by the other general, as the city can capture the messenger carrying the message. In addition to this, the sender of the message will not know if the message was received by the receiver general or not until the receiver general sends an acknowledgment message back to the sender. However, there can be a scenario when the original message was delivered to the receiver, but the acknowledgment was lost. In short, when a receiver doesn't get a message, it is impossible to say whether the sender didn't send a message or the city captured the messenger.
Network Behavior in Distributed Systems
In the distributed systems world, the two generals can be mapped to two nodes, and the city can be mapped to the network the two nodes use to communicate with each other. This thought experiment of the Two Generals Problems points out that no matter how many messages two nodes exchange, they can never be one hundred percent sure about each other's state.
In practical implementations, certain constraints are relaxed to make the systems work, and measures are implemented to address any message loss to some extent. For example, in a system comprising a game e-commerce store and a payment gateway service, to complete a purchase, payment should be successful, and a game download link should be sent. There can be scenarios where the communication between the payment service and the e-commerce service is dropped, and the messages are lost. There can be scenarios such as:
customer is charged, but the game download link is not sent because payment acknowledgment was lost
customer is charged, payment acknowledgment is received, but the game download link is not sent due to some service failure
In the first scenario, the e-commerce service can retry querying the payment service to ensure the payment has been processed. In the second scenario, the customer can be refunded. In both scenarios, we relaxed the constraints (and added upper bounds to number of retries) by allowing the action for a service to be reversed (payment getting refunded) or retried (e-commerce store querying payment service). These measures allow the system to work correctly because it's unlikely that all messages will be lost in a distributed system unless there is a severe network issue or a network partition.
References
Martin Kleppmann. (2020, October 28). Distributed Systems 2.1: The two generals problem [Video]. YouTube. https: //www.youtube.com/watch?v=MDuWnzVnfpI
Good article, but, if I may, I would ask two clarifications.
> Problems points out that no matter how many messages two nodes exchange, they can never be one hundred percent sure about each other's state.
I think that it is a little bit misleading, if both generals are correctly receiving both messages and acknowledges why can't they say that they know the state of each other?
I believe the sentence makes sense only when one of the two isn't receiving a message or an acknowledge.
> certain constraints are relaxed to make the systems work.
I think the word 'constraints' is a little bit misleading. What constraint are you relaxing with a retry system or an automatic refund?
What do you think?
Thank you