Comprehensive CCDAK quick reference: Kafka architecture mental models, producer/consumer configs, consumer groups and rebalancing, offset management, delivery semantics (idempotence/transactions), and schema evolution basics.
Use this for last‑mile review. Pair it with the Syllabus for coverage and Practice to validate speed/accuracy.
Kafka preserves order only within a single partition.
If you need ordering for a key/entity, ensure all its records route to the same partition (usually by key).
| Field | Meaning | Why it matters |
|---|---|---|
| Topic | Logical stream name | Partition count drives parallelism |
| Partition | Sub-stream shard | Ordering + parallelism boundary |
| Offset | Position in a partition log | Consumer progress marker |
| Key | Used for partitioning (default) | Controls ordering affinity |
| Value | Payload | Serialization choice matters |
| Headers | Metadata | Tracing/versioning/hints |
| Timestamp | Create/append time | Useful for time-based processing |
flowchart LR
P["Producer"] --> T["Topic"]
T --> P0["Partition 0: offsets 0..n"]
T --> P1["Partition 1: offsets 0..n"]
C0["Consumer"] --> P0
C1["Consumer"] --> P1
| You want… | Do this | Why |
|---|---|---|
| Ordering per customer/order/user | Use a stable key (e.g., customer_id) | Same key → same partition |
| Higher parallelism | Increase partitions | More partitions → more consumers can work |
| Avoid hot partitions | Use a better key / custom partitioner | Skewed keys create bottlenecks |
| Predictable consumer scaling | Partitions ≥ max consumer count | One partition can be consumed by only one consumer in a group |
Rule: A consumer group can have at most one consumer per partition (extra consumers idle).
| Goal | Key settings | Notes |
|---|---|---|
| Lowest latency | low linger.ms, smaller batches | More requests, higher overhead |
| Highest throughput | linger.ms + larger batch.size | More batching, higher latency |
| Strong durability | acks=all + sufficient replication | Wait for ISR to ack |
| Safe retries | enable.idempotence=true | Prevent duplicates due to retries |
| Setting | What it controls | High-yield notes |
|---|---|---|
acks | Durability acknowledgement | 0 fire-and-forget; 1 leader only; all ISR |
retries | Retry attempts | Works with backoff; avoid tight loops |
delivery.timeout.ms | Total time for send | Upper bound across retries |
linger.ms | Batch wait time | Higher → better throughput |
batch.size | Batch capacity (bytes) | Bigger batches can improve throughput |
compression.type | Compression | snappy/lz4 often good defaults |
max.in.flight.requests.per.connection | In-flight requests | With idempotence, Kafka enforces safe bounds |
1Properties props = new Properties();
2props.put("bootstrap.servers", "broker:9092");
3props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
4props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
5props.put("acks", "all");
6props.put("enable.idempotence", "true");
7
8KafkaProducer<String, String> producer = new KafkaProducer<>(props);
9producer.send(new ProducerRecord<>("orders", "customer-123", "{\"id\":42}"));
10producer.flush();
11producer.close();
Common traps
poll() frequently enough to satisfy liveness requirements (heartbeats).max.poll.interval.ms.| Setting | What it controls | Exam cues |
|---|---|---|
group.id | Consumer group identity | Enables load sharing + offsets per group |
enable.auto.commit | Auto offset commits | Convenience, less control |
auto.offset.reset | Start point if no committed offset | earliest vs latest |
max.poll.records | Records per poll | Batch size for processing |
session.timeout.ms | Liveness detection | Too low → false rebalances |
heartbeat.interval.ms | Heartbeat frequency | Must be < session timeout |
max.poll.interval.ms | Max time between polls | Long processing → rebalance risk |
1props.put("enable.auto.commit", "false");
2KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
3consumer.subscribe(List.of("orders"));
4
5while (true) {
6 ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(500));
7 for (ConsumerRecord<String, String> r : records) {
8 process(r);
9 }
10 consumer.commitSync(); // commit only after processing succeeds
11}
Trap: committing before processing → at-most-once behavior.
| Semantics | How you get it | Risk | Typical pattern |
|---|---|---|---|
| At-most-once | Commit before processing | Lost messages | Rare; only when duplicates are worse than loss |
| At-least-once | Process → commit after | Duplicates possible | Most common; make handlers idempotent |
| Exactly-once (Kafka) | Idempotent producer + transactions + read_committed | More complexity | Used for stream processing / pipelines |
Practical rule: In distributed systems, you often implement “exactly-once” end-to-end by combining at-least-once delivery with idempotent processing.
sequenceDiagram
participant B as "Broker (Group Coordinator)"
participant C1 as "Consumer 1"
participant C2 as "Consumer 2"
C1->>B: JoinGroup
C2->>B: JoinGroup
B-->>C1: Assign partitions
B-->>C2: Assign partitions
Note over C1,C2: Rebalance happens again when membership changes
Idempotence is about preventing duplicates caused by retries. It helps when the network is unreliable or broker responses are delayed.
Transactions let you write to multiple partitions/topics atomically and (with consumer settings) avoid reading uncommitted data.
| Concept | What it means | Remember |
|---|---|---|
| Transactional producer | Writes in a transaction | Requires transactional.id |
read_committed consumer | Reads only committed data | Hides aborted tx records |
read_uncommitted | Reads everything | May see aborted records |
| Format | Pros | Cons | Typical use |
|---|---|---|---|
| JSON | Human-readable, easy | No strict schema; larger payloads | Prototyping, logs |
| Avro | Compact + schema | Requires schema management | Strong default for evolving event contracts |
| Protobuf | Compact + strong types | Tooling complexity | Typed contracts across languages |
Compatibility answers: “Can old consumers read new data?” and “Can new consumers read old data?”
| Mode | Safe for… | Mental model |
|---|---|---|
| BACKWARD | New schema reads old data | New consumers ok |
| FORWARD | Old schema reads new data | Old consumers ok |
| FULL | Both directions | Safest, strictest |
| NONE | No guarantees | Fast iteration, high risk |
session.timeout.ms too low, processing blocks too long (max.poll.interval.ms), unstable membership.Broker (Kafka server) • Topic (stream name) • Partition (ordered shard) • Offset (position) • Consumer group (parallel readers) • ISR (in-sync replicas) • Rebalance (partition reassignment) • Idempotence (safe retries) • Transaction (atomic writes).