Why Distributed Transactions can be slow?
Discussing the cause of Latency for Transactions in Distributed Systems
There are certain times when a particular problem requires orchestrating updates to multiple different services/resources/shards together as an atomic unit of work. If any of the updates fail, the entire work unit needs to roll back because if partially executed, the state of the system as a whole can become inconsistent. This is the scenario where Distributed Transactions can prove to be one of the viable solutions. While in a monolithic or a non-sharded single resource, Transactions might be a great choice, they can cause some serious performance impact when executed within a distributed or a sharded system.
Behind the scenes: 2-Phase Commit
Well, it’s difficult to write about every protocol used by every system available on this planet. So, for the sake of simplicity, this text will focus on one of the most popular protocols used for distributed transactions — 2-Phase Commit protocol (or 2PC).
2PC as the name suggests makes a distributed transaction happen in two steps:
Step 1: Prepare
The orchestrator of the system sends a "prepare” message along with instructions in the given transaction to all the participants of the system. Each participant makes a log of these instructions and responds with an acknowledgement if it is ready to execute them.
Step 2: Commit
Once the orchestrator receives the acknowledgement from all the participants, it sends a “commit” message to all of them, requesting each of the participants to make a change in their state as per the instructions in the given transaction. However, if any of the participants don’t send an acknowledgement for the “prepare” phase, all of the remaining participants are told to roll back.
Besides this, the locks also are held on the system resources to ensure serializability.
Slow? Where?
By now, someone familiar with distributed systems will have a fair idea about the throttling areas that can slow down distributed transactions.
Resource Failure
In 2PC, the success of the transaction depends on whether all participants can acknowledge the prepare phase. If any participant/resource crashes or loses connectivity, the orchestrator waits until timeout which increases the latency and eventually triggers a rollback in which case the transaction needs to be restarted.
Network Latency
As multiple network round trips are required for each phase of 2PC including the retry attempts to establish a connection with unresponsive participants, any increment in network latency can cause a significant delay (or maybe timeout) in the transaction. In addition to this, an increase in the number of participants in the system will also impact the latency as the orchestrator is required to wait for all the participants to acknowledge the prepare phase.
Orchestrator Failure
In case the orchestrator fails before notifying the participants about the commit, participants may keep waiting (until timeout) for the orchestrator to recover. This can block the system from proceeding by halting the release of the locks.