Why are Event-Driven Systems Hard?

Sep 14

Understanding the Core Challenges of Asynchronous Architectures

4 Comments

This is a great overview. Thanks for sharing your wisdom in such an easily-understood article.

I'd add one major category to this list: sequencing. The article talks about “services” placing items on a queue and being read by other “services”, but services generally consist of a set of nodes acting somewhat independently. If service B is consuming messages M1, M2, and M3 from a queue, and service B consists of three server nodes B1, B2, and B3, it’s possible that all three messages will be processed concurrently, with no guarantee that the effects of processing M1 will occur before M3 is processed on another node.

You can solve this by having only a single consumer, but that hits scaling limits. The usual solution is to partition the queue into sub-queues, each with its own ordering guarantee, and ensure that each partition is only delivered to a single consumer.

This is a good solution, but comes with its own challenges. For one thing, how many partitions do you use? There’s a balance: too few and it limits your scaling, too many and you waste resources on overhead. It’s can be hard to change this later: you can’t really change the number of partitions while the system is running and maintain your ordering guarantees, but most systems at the scale where this is important can’t stop running. The best solution I’ve found is to essentially create a new set of topics and consumers, and then transition the system from the old ones to the new ones. This is tricky in practice and of course quite costly to get right and even more costly to get wrong. I have to admit, though: it's been a few years since I've looked at this problem. Maybe newer message broker systems have better solutions.

This also relates to the issue of message loss and the dead-letter queue. If all of the messages are independent, a dead-letter queue can be a good approach, but what if the process for M3 assumes that you’ve already processed M1 and M2? For example, let's say M1 creates an order, M2 updates it, and M3 dispatches it. Best case scenario: you’ll also crash on M3 and deliver that to the dead-letter queue (such as if M1 wasn't processed, and there's no order to update or dispatch), but if you aren’t careful, it might just succeed but do the wrong thing (such as if M1 and M3 are processed, but not M2).

Then, there’s also the problem of ensuring that the messages are enqueued in the right order, which might also not be straightforward, depending on the design of the producing system.

I like event-driven architectures, and think they can have a lot of benefits that outweigh the cost of these challenges (in the right use cases). I just think it's important that people go into these system design decisions with full awareness of the tradeoffs they're signing up to.

Expand full comment

Reply (1)

Sid

Sep 17

Thanks a lot for your kind words Tim. Glad you liked the article.

Excellent addition and totaly agree with the concerns around sequencing. Thanks a lot for poviding details with examples.

Expand full comment

Reply (1)

V_V

Oct 3

To tag along to this , there's also a problem of multi message dependencies, similar to ordering but subtly different.

Let's say Service A emits a message M1 and Service B emits a message M2, now Service C consumes both and requires both M1 and M2 to action a process. Therefore one of the event chains has to terminate and wait for the second event, however there's no guarantee of which event will arrive first in Service C.

One way around this to hold a DB state and save items unique to each of the messages M1 and M2, say K1 and K2 , then execute common logic at the end of each event processing in Service C only if K1 and K2 are available.

In this scenario you create a new problem, what if they both arrive at the same time. To fix this ensure the common logic only executes post a database commit and ensure that a database read happens at the start of the common logic and any database writes happen to only the columns in which K1 or K2 is saved, so that one save cannot override the other.

This scales differently based on the DB locking mechanism used. We ended up minimising the size of the common code to reduce likelihood of this occurrence and reduce the duration of the lock. (I'm all ears if someone has a non db locking solution)

This still causes a problem of saving potentially useless data K1 and K2 , solution to this is just to ensure K1 and K2 represent something meaningful worth saving (or a hack , populate a yet to be used db column)

Expand full comment

Ian Pak

Sep 16

Idempotency case, how we can make sure the insert is properly done or not? Even if it has a flag to mark the event processed, how does it guarantee the process is done properly?

Expand full comment

The Scalable Thread

Why are Event-Driven Systems Hard?