A quick overview of Apache Kafka, its benefits and drawbacks, and a set of use cases where Kafka shines.
A little bit of background
The Apache Kafka architecture
Apache Kafka is a distributed streaming platform, originally developed by Linkedin and subsequently open sourced in 2011. It is able to manage logs, called topics. A topic is an ordered, append-only collection of events (records) stored in a fault-tolerant and durable way: records are written to disk and replicated (1). Records are appended to the end of the log, and reads proceed left-to-right; each entry is assigned a unique sequential log entry number.
Events are written by producers. They serialize and publish messages to one or more Kafka topics. They can also compress and load balance events among brokers through partitioning. There are many kinds of producers: components of an application, IoT devices, monitoring agents.
Events are read by consumers. They read messages from the topics which they subscribe. Often, consumers belong to a consumer group: each consumer within a certain consumer group will be responsible for reading a subset of partitions of each topic that is subscribed to. Very often, the same entities can act as both consumers and producers.
Kafka acts, in a certain way, like the middle man between applications that generate events and application that consume events.
Kafka is designed to manage heavy applications and queue a large number of messages which are kept inside a topic. To ensure scalability, each topic is divided into multiple partitions and each partition in managed on a separate broker. To ensure fault-tolerance and highly-availability, every topic can be replicated, so that there are always multiple brokers with a copy of the data in case of broker failure.
Pros and Cons of Apache Kafka
Since it’s important to know the limitations of every technology, here is a list of Pros and Cons of Kafka.
High-throughput and Scalability: Kafka can process millions of messages per second and is able to shard topics into partitions for high scalability and high-performance sequential reads/writes. With multiple producers writing to the same topic via separate replicated partitions, and multiple consumers from multiple consumer groups reading from separate partitions, it’s possible to reach just about any level of desired scalability and performance;
Low latency: in the range of milliseconds. This feature allows to manage real-time events;
Fault-tolerant: thanks to topic replication (multiple copies of the data), a Kafka cluster is able to automatically recover from a failure of a broker;
Long term data storage: Kafka is different from traditional message queues (like RabbitMQ). Kafka retains the message after it was consumed for a period of time (default is 7 days), while RabbitMQ removes messages immediately after the consumer’s confirmation was received. Also, it’s not uncommon to have topics holding terabytes of data.
Security: Kafka provides a number of enterprise-grade security features for both authentication and authorization (Kerberos, TLS, Access Control Lists)
Kafka clients: Kafka is written in Java and Scala, but the list of programming languages that provide Kafka client libraries is big (e.g.: Clojure, C#, Go, Groovy, Kotlin, Node.js, Python, Ruby). There are also additional components of the Kafka platform which help interoperability: REST Proxy, Kafka Connect.
Increased operational complexity;
Incomplete set of monitoring tools.
Let’s take a look at the most common use cases of Apache Kafka.
Web activity tracking: it’s possible to track all users’ activity on a website (pages visited, products added to cart without a purchase, time spent reading a page, ads displayed to the user). All these information can be published to a dedicated topic and then handled for monitoring, analysis, reports or to adjust in real-time the ads based on performance.
Communication between services: Kafka is an excellent solution in a microservices architecture, improving the overall system efficiency and performance. Microservices use the publish/subscribe mechanism to interact with each other; producers and consumers are fully decoupled and agnostic of each other, which is a key design element to achieve the high scalability.
Real-time data processing: a common scenario is credit card fraud detections, by monitoring parameters like the location and frequency of transactions.
IoT data analysis: metrics sent by IoT devices are processed in real time and at scale. Examples: gps position of cars to perform real-time traffic recommendations; geolocation in mobile apps to provide context-specific selling promotions; engines and brakes sensors installed on public trains to detect anomalies.