Apache Kafka - I - High level architecture and concepts

First of all, what is Apache Kafka? An initial definition for it is "a high-throughput distributed messaging system". Breaking this down a bit:

  • High throughput: One of the main goals of Kafka is to process as much messages per unit of time as possible, in a scalable and fault-tolerant way. For example, LinkedIn, which is where Kafka originated, needs to achieve throughputs in the order of 20 million messages per second and 3 GB of data per second.
  • Distributed messaging system: In a generic distributed system, networked computers work together for a common goal. In Kafka's case, that goal is to send data from points A1, A2, ... AN, to B1, B2, ... BM. A common benefit of this kind of system, which is crucial to Kafka's success, is horizontal scalability: the ability to handle higher and higher volumes of data by adding nodes to the system (not to be confused with vertical scalability, which implies upgrading the nodes' hardware).

The next question would be: why Kafka? The key to that answer is the way in which Kafka addresses and overcomes the shortcomings in the usual data movement tools and approaches. Particularly, the issues which surface when data volume and frequency grow beyond a certain threshold. Therefore, if your system will need to move large amounts of data at high speed, Kafka might be a good fit for your needs.

The general architecture of a system which uses Kafka to move data is as follows:

Kafka's high-level architecture
  • Producers: Producer applications send messages to a specific topic. In the Kafka jargon, a topic is a collection of messages (more on this later).
  • Consumers: Consumer applications subscribe to a topic, in order to receive messages published to it. It can be said that Kafka implements publisher-subscriber semantics. Notice how the arrows go from consumers to the cluster: this is because Kafka uses a pull model: consumers actively poll the cluster for new messages.
  • Kafka cluster: A Kafka cluster is a collection of network-interconnected Kafka brokers, which are responsible for persisting the messages in their corresponding topics, and delivering them to the consumers when they request to read them. One of the brokers assumes the role of controller, distributing and coordinating work among the rest, who take the role of followers. To achieve fault tolerance, brokers can replicate topics with a configurable replication factor, so that if one broker goes down, the controller broker can send its work to a replica until it goes back up.
  • Zookeeper ensemble: Zookeeper is another distributed system in itself, whose goal is to manage another system, saving and tracking its metadata. In the case of Kafka, the Zookeeper nodes inside an ensemble (a collection of ZK nodes) work together to track which broker is the current controller, where is each broker in the network, replication factors, bootstrapping configuration for the Kafka cluster, broker health and much more.

This architecture achieves horizontal scalability by allowing clusters to grow in size, and also having multiple clusters. Zookeeper is essential to this end, because it takes care of the complexity of configuring and managing the cluster(s).

On the next post in this series, we'll take a closer look at how messages are structured into topics and partitions.

Go to Part II: Topics and Partitions

Comments

Popular posts from this blog

VB.NET: Raise base class events from a derived class

C++/CLI: Convert a String to BSTR or some other nasty stuff