Why are Event-Driven Systems Hard?
Understanding the Core Challenges of Asynchronous Architectures
An event is just a small message that says, "Hey, something happened!" For example, UserClickedButton
, PaymentProcessed
, or NewOrderPlaced
. Services subscribe to the events they care about and react accordingly. This event-driven approach makes systems resilient and flexible. However, building and managing these systems at a large scale is surprisingly hard.
Managing the Message Format Versions
Imagine you and your friend have a secret code to pass notes. One day, you decide to add a new symbol to the code to mean something new. If you start using it without telling your friend, your new notes will confuse them. This is exactly what happens in event-driven systems.
For example, an OrderPlaced
event might look like this:
Now imagine another service reads this event to send a confirmation email. Then, six months later, you add a new field: shippingAddress
. You update the producer. The event becomes:
The problem is that other services, like the OrderConfirmationEmailService
, might still be expecting the old version 1 format. When they receive this new message, they won't know what to do with the shippingAddress
field. Worse, if a field they relied on was removed, they would simply crash.
This forces teams to carefully manage how schemas evolve. Common strategies include:
Backward Compatibility: New schemas can be read by services expecting the old schema. This usually means you can only add new, optional fields. You can't rename or remove existing ones.
Forward Compatibility: Services expecting a new schema can still read messages written in an old one. This is harder to achieve and often requires setting default values for missing fields.
Schema Registry: This is like a central dictionary for all your event "secret codes." Before a service sends a message, it checks with the registry to make sure the format is valid and compatible. It prevents services from sending out "confusing notes."
Without strict rules for changing message formats, a simple update can cause a cascade of failures throughout a large system.
Observability and Debugging
In a traditional, non-event-driven system, when a user clicks a button, one piece of code calls another, which calls another, in a straight line. If something breaks, you can look at the error log and see the entire sequence of calls, like following a single piece of string from start to finish.
In an event-driven system, that single string is cut into dozens of tiny pieces. The OrderService
publishes an OrderPlaced
event. The PaymentService
, ShippingService
, and NotificationService
all pick it up and do their own work independently. They might, in turn, publish their own events.
Now, imagine a customer calls saying they placed an order but never got a confirmation email. Where did it go wrong?
Did the
OrderService
fail to publish the event?Did the
NotificationService
not receive it?Did it receive the event but fail to connect to the email server?
Debugging this can be difficult as you can't see the whole picture at once.
To solve this, we use distributed tracing. When the very first event is created, we attach a unique ID to it, called a Correlation ID. Every service that processes this event or creates a new event as a result must copy that same ID onto its own work.
When you need to investigate a problem, you can search for this one correlation ID across all the logs of all your services. This allows you to stitch the story back together and see the journey of that single request across the entire distributed system.
Handling Failures and Message Loss
Events can disappear. Not because of bugs — because of infrastructure issues like network failure, a service crashing, or the message broker itself having a problem.
The core promise of many event systems is at-least-once delivery. This means the system will do everything it can to make sure your event gets delivered. If a service that is supposed to receive an event is temporarily down, the message broker will hold onto the message and try again later.
But what if a service has a persistent bug and crashes every time it tries to process a specific message? The broker will keep trying to redeliver it, and the service will keep crashing, until the broker retry limit is reached. To handle this, we use a Dead-Letter Queue (DLQ). After a few failed delivery attempts, the message broker moves the crash causing message to the DLQ. This stops the cycle of crashing and allows the service to continue processing other, valid messages. Engineers can then inspect the DLQ later to debug the problematic message.
Idempotency
The guarantee of "at-least-once delivery" creates a new, tricky problem: what if a message is delivered more than once? This can happen if a service processes an event but crashes before it can tell the message broker, "I'm done!" The broker, thinking the message was never handled, will deliver it again when the service restarts.
If the event was IncreaseItemCountInCart
, receiving it twice is a big problem. The customer who wanted one item now has two in their cart. If it was ChargeCreditCard
, they get charged twice.
To prevent this, services must be idempotent. We can achieve idempotency by having the service keep a record of the event IDs it has already processed. When a new event comes in, the service first checks its records.
Has it seen this event ID before?
If yes, it simply ignores the duplicate and tells the broker, "Yep, I'm done."
If no, it processes the event and then saves the event ID to its records before telling the broker it's done.
This ensures that even if a message is delivered 100 times, the action is only performed once.
Eventual Consistency
In a simple application with one database, when you write data, it's there instantly. If you change your shipping address, the very next screen you load will show the new address. This is called strong consistency.
Event-driven systems give up this guarantee for the sake of scalability and resilience. They operate on a model of eventual consistency. For example, when a user updates their address, the CustomerService
updates its own database and publishes an AddressUpdated
event. The ShippingService
and BillingService
subscribe to this event. But it might take a few hundred milliseconds for them to receive the event and update their own data (This example is to provide some context, but ideally, the address should be stored at one place and the id of that record should be passed around in the events).
Designing for eventual consistency means the system must be built to handle this temporary state of disagreement. This might involve:
Designing user interfaces that account for the delay.
Adding logic to services to double-check critical data if needed.
Accepting that for some non-critical data, a small delay is acceptable.
If you enjoyed this article, please hit the ❤️ like button.
If you think someone else will benefit from this, please 🔁 share this post.