The resurgence of event driven architecture

Kim Clark
13 min readMay 28, 2021

--

Working in the integration space for the last few decades, the idea of event driven anything seems like old news. Messaging after all has been around for as long as I can remember. It would be easy therefore to dismiss the recent resurgence in interest in event driven architecture (EDA) over the last few years as just a re-discovery of old techniques. However, that conclusion would be misunderstanding the changes in technology, the specific patterns that are rising to the surface, and indeed the underlying reasons for the trend in the first place.

This piece feels long overdue as I’ve been presenting around this topic and discussing it with customers for many years. However, the historical context and its relationship to a range of technologies is critical for a number of other articles I plan to write going forward.

So why the renewed interest in asynchronous interaction, what are technologies in play, and how is their usage different?

Defining EDA

First things first, what do we mean by event driven architecture (EDA). Popular use of the term itself is at the very least more than a decade old, and the idea of “event driven” in general goes back as far as computing itself. In the broadest sense it just means capitalizing on the benefits of truly asynchronous interactions in order to ensure a complete hand off between one component and another. This frees up the upstream resources (threads for example) to do other work and enables simple workload distribution across downstream components.

We can add increasing levels of sophistication to that by adding concepts such as queues, publish/subscribe, assured once only delivery, durable subscriptions, dead letter queues and many more. Messaging patterns such as these were described long ago in books such as Enterprise Integration Patterns (2003) and before that too, and have been implemented by messaging technology such as IBM MQ and used by applications and integration solutions throughout that time.

A slightly more recent technology is of course Apache Kafka, an event streaming capability which was open sourced by LinkedIn in 2011 (so it’s clearly not that new!) and initially gained traction due to its ability to satisfy exceedingly large numbers of subscribers at high message throughput. However, its underlying implementation retains an event history, and this in itself became useful for a subset of application patterns.

Many other technologies for asynchronous communication exist, each with their own characteristics. In this short article, we’re going to focus on the relationship between messaging based technology, and those based on Kafka, as these are two of the most prevalent options in play at the current time.

In line with the terminology typically used in the industry, we will mostly use the term “message” when talking about messaging technology, and “event” when talking about Kafka. However, we need to be clear that use of the term “event” in “event driven architecture” existed well before Kafka, and conceptually messaging based technology is just as much “event driven” as Kafka based technology.

So, event driven architecture refers generically to writing applications in a way that benefits from asynchronous communication. This can and has been applied to everything from the back end payments systems underpinning much of the worlds core banking systems, through to the capturing of the minutia of front end user interactions.

Fundamental differences and similarities

We’re going to cover lots of deeper subtleties, but let’s start with the core functional differences between messaging and Kafka

Messaging systems typically delete messages as they are read. Removing messages as part of a read means an application does not need to worry about reprocessing that message accidentally. This reduces the complexity of the application, but means that the messaging server must do “granular” processing of individual messages with immediate acknowledgement. So, messaging provides a natural fit where accurate delivery of individual messages is required, but a consumer can’t go back to re-read past messages.

In Kafka, events are not deleted as they are read. It retains an “event log”; a history of recent events that have happened. They are then deleted through administrative choices around archiving rather than based on which applications have actually seen them. The event history itself turns out to be useful for some application patterns as we’ll discuss later. Kafka utilizes this event history to enable “stream” based delivery of events, where the consumer is less concerned about individual events and happy for acknowledgements to be handled asynchronously. This is what enables Kafka to satisfy high consumption rates, especially where numbers of subscribers are high (e.g. 100s and up). However, it also means that ambiguity is pushed to the application to resolve, resulting in more complex code on the consumer side.

It is worth noting that its not quite as clear cut as I’ve described above. For example, messaging does enable processing of messages in batches, and Kafka can be used in such a way that provides more granular acknowledgements. However, what we’re trying to articulate here is what the sweet spot is for each technology.

These technologies do overlap too, and most commonly cited is that they both enable the publish/subscribe interaction pattern. For Kafka, this is essentially its only pattern, whereas messaging enables a number of other options. The way they implement publish/subscribe has important differences however, and we’ll discuss the implications of that in a moment.

The need to decouple microservice interactions

Microservice based applications came to popularity around 2014 capitalizing on the increasingly more lightweight runtimes, and the maturing of container technology. The need for strong decoupling between these fine-grained components is one of the key reasons for an increase in interest in asynchronous interaction models.

Microservice architecture is a far-reaching topic in itself, but at its heart is about breaking large monolithic applications up into smaller parts in order to gain benefits such as greater development agility, independent elastic scalability, granular resilience models and more. Microservice architecture is just one perspective on what it means to be cloud native, which is an even broader topic in its own right.

Microservice based applications achieve many of the above-mentioned benefits due to strong decoupling between their individual microservice components. This means separation of logic/code (not trivial), and of data ownership (really hard), but also reducing interactions between components (almost impossible!). The rise of container technology enabled microservice components to run in what are essentially completely independent processes, but unfortunately few if any microservice components can achieve much on their own. No matter how well decoupled the design, they almost always need information from one another, and indeed from outside the application boundary too.

Coupling issues with HTTP based interaction

The simplest way programmatically to retrieve data from a separate component is to do a remote call over the network. Today by far the most common way to achieve this is to use RESTful APIs communicating over the ubiquitous HTTP protocol. There are plenty of other options for specific needs such as performance (e.g. gRPC) or granularity (e.g. GraphQL), and of course the older SOAP based web services are less favored, but still around. These all fall into the category of synchronous interactions. The caller asks for specific information and waits for the response from the downstream component.

As noted, this is a wonderfully simple programming model, but it comes with one obvious issue. The downstream component must be available, sufficiently performant, and must scale to the needs of these new upstream components. This places a real-time coupling between the calling (upstream) microservice component, and the providing (downstream) microservice component, which is exactly what we were trying to avoid. Worse still, if getting data from systems beyond the boundary of the application we may end up with a dependency on the availability, performance and scalability of an older legacy system. Furthermore, we will be incurring the latency of all the network hops between potentially a chain of microservice and other components. So, it is clear that API interactions result in deep real-time coupling between requestor and provider.

Event driven architecture to the rescue?

It is a common misconception that using an asynchronous transport instead of HTTP will magically remove the coupling between components. Whilst there’s some truth in that, there is a lot of subtlety too.

Asynchronous transports will certainly help if the interaction pattern is fire and forget (the caller does not require a response), and asynchronous transports of all types are perfect for this. However, as soon as we drill into the specific requirements, the picture becomes more complex.

Messaging might be preferred when you want to ensure once only delivery and you want to make the programming model as simple as possible for the components consuming the events. Then again, if your primary concern is raw throughput, and your consumers are happy to process batches of events at a time, the event streaming nature of the Kafka interface may prevail.

If you want a group of consumers to share the events between them, processing each event only once across the group, then both technologies enable the idea of a consumer group. However, if you need to be able to scale the number of consuming applications elastically, note that Kafka has limits on the number of consumers in a consumer group based on how its data is partitioned, whereas messaging allows any number of consumers in a group.

If you have large numbers of components that want to receive all the events, then streaming events through a Kafka topic well suited to the use case. Then again if you want to be able to introduce and remove topics in a fairly lightweight fashion, and filter on them using a hierarchical topic tree, then this may be more suited to a messaging implementation.

I did say it was subtle, didn’t I?

Moving on from fire and forget interactions, you may want to instantaneously retrieve data, or some other interaction which requires a request/response pattern. In this case no matter what transport you use, using an asynchronous transport will not stop you from being coupled to the component you are calling. Your application will be completely dependent at run time on the availability and performance of the downstream component whilst you wait for the response.

We should recognize that performing a request/response over an asynchronous transport does have some advantages as it can free up resources such as threads and memory. This continues to be a relatively common pattern when looking for those benefits, but it should be noted that this really only works with queue-based messaging, and doesn’t resolve the real-time coupling whilst we wait for the response.

Furthermore, the request/response pattern over publish/subscribe topics is not a good fit. Since Kafka only provides a publish/subscribe pattern, for most use cases it is considered an anti-pattern to perform request/response using Kafka. So if you have a need for request response over an asynchronous transport, a message queuing solution should be used.

In summary, asynchronous transport does enable some decoupling, but you have to closely inspect your interaction patterns to know how and whether you will benefit.

Introducing event projections

There is an alternative pattern with asynchronous interaction based on a form of caching that can provide runtime decoupling, at the cost additional complexity in the overall design and operational topology.

If we can publish every change that occurs to the data we are interested in as an “event”, then this can be used to re-construct the back end data. Our microservice could then listen to these events in advance of needing the data and build up a cache (often termed a projection in this context) either in memory, or perhaps in a local data store. Then, when a request came in that needed that data, we could serve it immediately from the projection with minimal latency. No coupling to the availability and performance of the component that actually owns the data, and we can store the data in exactly the form we want, and ensure it is architected for our scaling and availability needs.

This caching pattern, sometimes referred to as an “event projection”, is described in much more detail in a separate paper along with its relationship to other patterns such as CQRS and event sourcing. It of course comes with all the well-known issues associated with caching, due to the second copy of master data. It is clearly a much more complex design, with more moving parts, and we have to consider data currency, eventual consistency, pre-population, cache invalidation, data transmission and storage costs, and so on. However, it does remove our runtime dependency and is therefore a popular option in microservices architecture when independence is required between components that have an interest in the same data.

Event projections could be built using any asynchronous transport. However, there are some possible advantages to implementation of this pattern using a capability such as Kafka which is tuned to streaming data and that has an event history as it enables the consumer to re-build the local data store at will. We might for example want to change the way the data is stored, and perhaps even what it is stored in, and we could use the event history to rebuild the data in the new model or store. The local data store could also be held just in memory for efficiency (rather than persisted to disk) since it could be rebuilt from the history if required.

Of course, use of the event history in this way relies on there being enough history present in Kafka. The use case needs careful consideration to ensure we understand the profile of the data. How big could the history get, how long will it take to re-build from it, can we archive or compact some of it, do we need to consider data retention policies and so on, and fundamentally, are we comfortable with the increased complexity this adds to our solution.

Choosing the technology based on the use case

We have walked through a few examples where the detail of the requirement becomes particularly important when choosing technology. Let’s take a step back and attempt to summarize more broadly the applicability of each technology across different use cases.

Messaging is ideal if you want to request that a particular action is routed asynchronously to a particular application, wherever it happens to have moved to on the messaging network, and ensure it is performed once and only once. One of the greatest benefits of the “destructive read” behavior of messaging is that it makes it easier to distribute work across a dynamically scalable numbers of consumer instances (as opposed to distributing across a fixed small consumer group, or indeed broadcasting all messages to all consumers). Another key point is that messaging is less likely to get caught up in discussions around data retention policies, and may well be easier if granular security policies are required. Messaging implementations such as IBM MQ also enable deeper transactionality and assure once only delivery in a way that is arguably more straightforward for the consumer. Finally, as we’ve mentioned already, in addition to fire/forget interactions, messaging is also suited to request/response patterns in a way that Kafka specifically is not.

In comparison, if you want to be able to broadcast high volumes of events, sending all events to a large number of subscribers, then Kafka is well suited to that. If you want to pour huge amounts of data down a firehose into an analytics engine or more generally where batching of events is preferable, then this aligns with Kafka’s streaming capabilities. If you want to perform patterns that need to revisit past events such as event projections, or event processing then Kafka’s event history may help with that.

The above few paragraphs unashamedly race through some very subtle and important differences between the two technologies. My intention here is not to provide a detailed comparative guide as these are documented already, but to drive home the point that they are very different technologies. Each has sweet spots and for some use cases it really matters which one you pick.

Perhaps we shouldn’t be comparing these technologies at all?

If they are so different, why are we even comparing them in the first place? Comparisons on an infrastructure level quickly show that messaging technologies were designed with a very different scope to Kafka. Messaging can be deployed in a highly granular fashion, and this is because it has few if any dependencies. You can stand up an individual production grade queue on a fraction of a core. Its job is purely to provide a place to store message whilst in transit from one application to another. It can of course scale to highly sophisticated messaging networks, but the starting footprint for a server is deliberately small.

Kafka on the other hand has a relatively large base installation. It is designed to provide a shared “event backbone” service to many providers and consumers, offering longer term storage of data, schema registries, and requiring re-balancing tooling (e.g. Apache Cruise Control) and more. For this is requires a larger starting footprint.

Perhaps we shouldn’t be comparing them at all. I’ve sometimes distinguished the two by saying that where queue-based messaging provides an asynchronous transport, Kafka is more of an alternative type of data store. There are clearly limitations in that analogy, but I think it does help to separate the primary purpose of each technology. In reality we often see these technologies used in combination, each fulfilling the patterns that they are designed for.

Resurgence, or revolution?

There are ever growing reasons to introduce further asynchronous interaction patterns. The proliferation of increasingly distributed (microservice) applications. Applications that used to live close to one another may now be distributed across multiple different cloud platforms. Enterprises are reducing their operational workload by outsourcing applications to software as a service vendors. Interactions extend to billions of smartphones, and an uncountable number of IoT (internet of things) devices. I could go on.

All of these amount to a more distributed technology landscape crying out for the benefits of decoupling via asynchronous communication. Equally, each of those use cases is unique in its requirements. Some will be obsessing on throughput, others on large numbers of consumers, others on footprint, or power consumption, or geographical spread, or data integrity, or security.

Are we just seeing a resurgence of concepts that have been around for decades, or is something new afoot? The answer of course is both, as I’ve tried to explore in this post. We are seeing new patterns, but we are also seeing new uses for old patterns. As a result, we still need to pay close attention to the requirements before choosing the technology. There is no one-size-fits-all, and there never will be.

One thing is for sure. The number and style of interactions will only increase. We’re going to have to think hard about we ensure we can manage and maintain these increasingly complex integration landscapes, but that’s for another time.

Acknowledgements: Thanks (alphabetically on surname!) to Carsten Bornert, Alan Chatt, Callum Jackson, Dale Lane, David Ware for their sanity checking and insights.

--

--

Kim Clark

Integration focused architect. Writes and presents regularly on integration architecture and design topics.