When designing large-scale software applications, it's common to base assumptions or beliefs about the components of a distributed system on personal experiences from using modern-day software applications. However, these basic assumptions can become significant challenges when the system encounters production traffic.
Fallacies of Distributed Systems
1) The Network is Reliable
By nature, a distributed system comprises multiple nodes, systems, or services connected and communicating over a network. Any change or fault in the network can disrupt communication and hinder the system's functionality. These faults can result from factors such as subsea cable cuts, switch failures, or power outages. Additionally, most hardware responsible for network reliability operates with some form of software or firmware. Any subtle bugs or failures in this software can also cause network issues. This leads us to the conclusion that the network is NOT reliable.
2) Latency is Zero
In a distributed system, data centers or servers may be located in the same region or opposite parts of the world. While this geographic replication enhances the resilience of the service, it also increases the time it takes for communication or data to travel between machines. Additionally, the inherent unreliability of networks can lead to packet delays, further extending the time required to complete client instructions. Despite modern networks being quite fast, with global data fetch round-trip latencies typically measured in milliseconds, the notion that latency is zero is incorrect.
3) Bandwidth is Infinite
Bandwidth measures the amount of data that can be transmitted over a network per unit of time. While modern large-scale applications may create the impression of unlimited bandwidth, there are instances where data demands, like those from online streaming or gaming, can exceed the available capacity, leading to throttling. Moreover, when sharing infrastructure resources such as message brokers among multiple services, bandwidth per producer is constrained to prevent high-throughput producers from impacting other services.
4) The Network is Secure
The first lesson when designing web forms that take user input is: never to trust user input. Similarly, the advice for connecting to open Wi-Fi networks is: never connect to open Wi-Fi networks. Furthermore, when designing communication among different services over a network, it is crucial to authenticate the source of requests. This is because the internet is full of unfriendly users with malicious intent—those who sniff network traffic to decode communication, or attempt to attack firewalls to access private networks, services, or databases. This demonstrates that the network is NOT secure by nature.
5) Topology doesn’t Change
Network topology refers to the arrangement of different nodes in a network. Although it might not be evident to the average internet user, this arrangement changes regularly due to events such as node or data center failures, network partitions, high traffic loads on a single resource, and the addition of new services, nodes, or data centers. These changes result in the re-routing of client requests, which can increase or decrease latency. Such frequent changes in topology can lead to scenarios where the caller times out, fails while waiting for a response, or is unable to connect to the correct callee.
6) There is One administrator
In modern computing, with easy access to cloud computing tools and the increasing complexity of online software applications, it is practically infeasible for a single person to manage everything from design to development to deployment. Most internet applications consist of microservices, each owned by different teams with their own development and deployment tools and cycles. Consequently, it is impractical to have one administrator responsible for everything.
7) Transport Cost is Zero
To enable communication between different entities in a distributed system, data must be transported over the network. Similar to moving physical goods in the real world, transporting data online also requires resources such as physical hardware, network bandwidth, software, and power. Although these non-zero transport costs may seem minor initially, they ultimately comprise a significant portion of the overall expenses of operating an online large-scale application.
8) The Network is Homogeneous
Different services or nodes deployed on the internet often have varying configurations in terms of hardware specifications or operational parameters. For example, every device connected to the internet has distinct hardware specifications. Designing a system assuming all devices have homogeneous configurations will lead to performance or compatibility issues. This principle also applies to large-scale distributed systems deployed across multiple geolocated data centers, where the physical machines have heterogeneous configurations.