Apache Kafka - V - Consumer Groups

A single consumer, having to receive and process messages in a single thread, can soon become a bottleneck, especially as producers, topics and partitions grow in number. Kafka overcomes this limitation by allowing consumers to share their load as evenly as possible by setting the group.id option on creation, and thus belonging to a consumer group.

When a bunch of consumers set the same group id, a broker inside the cluster is automatically assigned as the group coordinator: his goal will be to monitor the consumers and keep track of their membership. The group coordinator also communicates with Zookeeper and the cluster controller/leader node in order to assign partitions to each consumer in the group. The ideal situation is to have consumer and partitions evenly matched, i.e. each consumer will read only one partition from the topic.

Each consumer regularly sends a heartbeat to the cluster (with a frequency defined by the heartbeat.interval.ms consumer setting). If the coordinator hasn't received a heartbeat from a consumer after session.timeout.ms milliseconds, the consumer will be consider dead, and this will trigger a rebalance. In this scenario, this will imply the coordinator removing the dead consumer from the group, and assigning its partition(s) to some other active consumer. The consumer with increased load will take over from the last committed offset (or apply another reset policy, if such is the case). If there was a gap between current position and last committed offset, there might be duplicate records! (which were processed by the failed consumer and will be reprocessed by his successor). Whether this is a problem will depend on the application. If it is, it might be necessary to commit offsets manually and synchronously.

If a new consumer joins the group, or a partition is added to the topic, the rebalance protocol will be triggered as well. In any case, it will always attempt to have a 1:1 consumers:partition ratio. For higher reliability, some applications may have more consumers than partitions, so that the spare consumers can take over failed ones and keep servicing the same amount of partitions until failed consumers recover. (If there are more consumers than partitions, the extra consumers will remain idle until they are needed).

Advanced consumer settings

These will only be mentioned and briefly described. They are worth further exploration.

  • Consumer performance
    • fetch.min.bytes: Minimum batch size to read after poll times out. This is analogous to the producer's batch size.
    • max.fetch.wait.ms: Time to wait if fetch.min.bytes is not yet satisfied. 
    • max.partition.fetch.bytes: To avoid fetching more than the processing loop can handle at once.
    • max.poll.records: Similar to the previous one, but measures in amount of records.

Advanced Consumer subjects not covered

  • Consumer position control: Using the seek() method, it is possible to read a specific message from the topic. With seekToBeginning(), the consumer can restart the offset manually. There's also seekToEnd() for checking the latest message straight away.
  • Flow control: This allows pausing and resuming topics and partitions, if the consumers decides to dynamically focus on specific topics for a while, and resume paused topics later. The two API functions are, expectably:
    • pause
    • resume
  • Rebalance listeners: These allow a consumer to be notified when a rebalance takes place, so that he may adjust his behavior if necessary.
  • Kafka Schema Registry (by Confluent): This is complex enough for its own series of posts, but the motivation behind this addition to Kafka is to handle producer-consumer data format contracts by defining schemas for each, with version management included.
  • Kafka Connect (by Confluent): Another addition to Kafka which deals with integrating common data sources: relational databases, Hadoop, file systems, and many more. These "connectors" save developers from writing consumers for each data source over and over.
  • Kafka Streams: Yet another addition to Kafka which leverages existing Kafka infrastructure to implement real time, stream-based processing. This is ideal for big data scenarios.

Comments

Popular posts from this blog

VB.NET: Raise base class events from a derived class

Apache Kafka - I - High level architecture and concepts

C++/CLI: Convert a String to BSTR or some other nasty stuff