Kafka Overview: Architecture, Components, and Internals
An overview of Apache Kafka, covering its architecture, core components, and internal mechanisms based on Kafka 2.4.
Kafka Overview: Architecture, Components, and Internals
An overview of Apache Kafka, covering its architecture, core components, and internal mechanisms based on Kafka 2.4.
Introduction
Apache Kafka is a distributed streaming platform. It can be used both as a traditional message broker for publish–subscribe systems and as a stream storage and processing system for large-scale data pipelines.
Thanks to its distributed design, Kafka provides:
- High throughput
- Fault tolerance
- Horizontal scalability
Typical use cases include:
- Event-driven business systems
- Log aggregation
- Stream processing with big data frameworks (e.g., Kafka + Samza)
This article explains Kafka’s core concepts and internal mechanisms based on Kafka 2.4.
Core Components
From a physical deployment perspective, Kafka consists of the following modules:
-
ZooKeeper
Stores metadata and provides event notifications. -
Broker
The core Kafka server (implemented in Scala), responsible for handling client requests and persisting message data. -
Clients (Producers & Consumers)
Implemented in Java, responsible for producing and consuming messages.
Key Concepts
- Broker
- Producer
- Consumer
- Controller
- GroupCoordinator
- TransactionCoordinator
- Topic
- Partition
- Replica
ZooKeeper
Apache ZooKeeper acts as Kafka’s metadata store and coordination service.
It stores information such as:
Cluster Metadata
/cluster/id
{
"version": 4,
"id": 1
}Controller Metadata
/controller
{
"version": 4,
"brokerId": 1,
"timestamp": "2233345666"
}/controller_epoch
Broker Metadata
/brokers/ids/{id}
{
"version": 4,
"host": "localhost",
"port": 9092,
"jmx_port": 9999,
"endpoints": ["CLIENT://host1:9092", "REPLICATION://host1:9093"],
"rack": "dc1"
}/brokers/seqid(used to generate broker IDs)
Topic & Partition Metadata
/brokers/topics/{topic}/brokers/topics/{topic}/partitions/{partition}/state
Consumer Metadata (legacy)
/consumers/{group}/offsets/{topic}/{partition}
Broker Startup Process
When a Kafka broker starts, it performs the following steps:
- Initialize ZooKeeper client and create root nodes
- Create or retrieve the Cluster ID
- Load local broker metadata
- Generate or read broker ID
- Start the Replica Manager and related background tasks
- Register broker information in ZooKeeper
- Start the Controller election
- Initialize GroupCoordinator
- Initialize TransactionCoordinator
- Start request handling services
Only one broker in the cluster becomes the Controller, responsible for:
- Broker lifecycle events
- Leader election
- Topic creation and deletion
- Partition reassignment
Producer
Example
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);
producer.send(new ProducerRecord<>("my-topic", "key", "value"));
producer.close();Key Characteristics
- Thread-safe
- Uses internal buffers for batching
- Supports retries and acknowledgements
acksdetermines durability guarantees
Consumer
Example
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("my-topic"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
System.out.println(record.value());
}
}Key Characteristics
- Not thread-safe
- Supports consumer groups
- Automatic or manual offset management
- Rebalancing on membership changes
Summary
Kafka is a powerful and flexible platform for both messaging and stream processing.
Key takeaways:
- Distributed by design
- High throughput and fault tolerance
- Strong ecosystem for stream processing
- Requires understanding of internals for effective use
For deeper understanding, consider reading Kafka’s source code and official documentation.