CCDAK Cheatsheet — Kafka Producers/Consumers, Offsets, Semantics & Schema Basics

Comprehensive CCDAK quick reference: Kafka architecture mental models, producer/consumer configs, consumer groups and rebalancing, offset management, delivery semantics (idempotence/transactions), and schema evolution basics.

Use this for last‑mile review. Pair it with the Syllabus for coverage and Practice to validate speed/accuracy.


1) Kafka mental models (the exam is testing these)

The unit-of-ordering rule

Kafka preserves order only within a single partition.
If you need ordering for a key/entity, ensure all its records route to the same partition (usually by key).

Record anatomy

Field Meaning Why it matters
Topic Logical stream name Partition count drives parallelism
Partition Sub-stream shard Ordering + parallelism boundary
Offset Position in a partition log Consumer progress marker
Key Used for partitioning (default) Controls ordering affinity
Value Payload Serialization choice matters
Headers Metadata Tracing/versioning/hints
Timestamp Create/append time Useful for time-based processing
    flowchart LR
	  P["Producer"] --> T["Topic"]
	  T --> P0["Partition 0: offsets 0..n"]
	  T --> P1["Partition 1: offsets 0..n"]
	  C0["Consumer"] --> P0
	  C1["Consumer"] --> P1

2) Partitions, keys, and throughput (high-yield table)

You want… Do this Why
Ordering per customer/order/user Use a stable key (e.g., customer_id) Same key → same partition
Higher parallelism Increase partitions More partitions → more consumers can work
Avoid hot partitions Use a better key / custom partitioner Skewed keys create bottlenecks
Predictable consumer scaling Partitions ≥ max consumer count One partition can be consumed by only one consumer in a group

Rule: A consumer group can have at most one consumer per partition (extra consumers idle).


3) Producer essentials (configs you must recognize)

Producer reliability and performance pickers

Goal Key settings Notes
Lowest latency low linger.ms, smaller batches More requests, higher overhead
Highest throughput linger.ms + larger batch.size More batching, higher latency
Strong durability acks=all + sufficient replication Wait for ISR to ack
Safe retries enable.idempotence=true Prevent duplicates due to retries

Producer configs (what they mean)

Setting What it controls High-yield notes
acks Durability acknowledgement 0 fire-and-forget; 1 leader only; all ISR
retries Retry attempts Works with backoff; avoid tight loops
delivery.timeout.ms Total time for send Upper bound across retries
linger.ms Batch wait time Higher → better throughput
batch.size Batch capacity (bytes) Bigger batches can improve throughput
compression.type Compression snappy/lz4 often good defaults
max.in.flight.requests.per.connection In-flight requests With idempotence, Kafka enforces safe bounds

Java: minimal producer pattern (safe by default)

 1Properties props = new Properties();
 2props.put("bootstrap.servers", "broker:9092");
 3props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
 4props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
 5props.put("acks", "all");
 6props.put("enable.idempotence", "true");
 7
 8KafkaProducer<String, String> producer = new KafkaProducer<>(props);
 9producer.send(new ProducerRecord<>("orders", "customer-123", "{\"id\":42}"));
10producer.flush();
11producer.close();

Common traps

  • Retries without idempotence can create duplicates.
  • Keys control partition choice (and therefore ordering). No key → round-robin partitioner behavior (varies by client/config).

4) Consumer essentials (poll loop + offsets)

Poll loop invariants

  • Call poll() frequently enough to satisfy liveness requirements (heartbeats).
  • Don’t block processing so long that you exceed max.poll.interval.ms.

Consumer configs (what they mean)

Setting What it controls Exam cues
group.id Consumer group identity Enables load sharing + offsets per group
enable.auto.commit Auto offset commits Convenience, less control
auto.offset.reset Start point if no committed offset earliest vs latest
max.poll.records Records per poll Batch size for processing
session.timeout.ms Liveness detection Too low → false rebalances
heartbeat.interval.ms Heartbeat frequency Must be < session timeout
max.poll.interval.ms Max time between polls Long processing → rebalance risk

Java: “manual commit after processing” pattern

 1props.put("enable.auto.commit", "false");
 2KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
 3consumer.subscribe(List.of("orders"));
 4
 5while (true) {
 6  ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(500));
 7  for (ConsumerRecord<String, String> r : records) {
 8    process(r);
 9  }
10  consumer.commitSync(); // commit only after processing succeeds
11}

Trap: committing before processing → at-most-once behavior.


5) Offset semantics (at-most-once vs at-least-once vs EOS)

Semantics How you get it Risk Typical pattern
At-most-once Commit before processing Lost messages Rare; only when duplicates are worse than loss
At-least-once Process → commit after Duplicates possible Most common; make handlers idempotent
Exactly-once (Kafka) Idempotent producer + transactions + read_committed More complexity Used for stream processing / pipelines

Practical rule: In distributed systems, you often implement “exactly-once” end-to-end by combining at-least-once delivery with idempotent processing.


6) Consumer groups and rebalancing (what to do safely)

Rebalance triggers (high level)

  • Consumer joins/leaves the group
  • Partition count changes
  • Session timeout / poll interval violations

Safe rebalance handling

  • On partition revoke: commit processed offsets and clean up local state.
  • On partition assign: initialize any needed state and resume processing.
    sequenceDiagram
	  participant B as "Broker (Group Coordinator)"
	  participant C1 as "Consumer 1"
	  participant C2 as "Consumer 2"
	
	  C1->>B: JoinGroup
	  C2->>B: JoinGroup
	  B-->>C1: Assign partitions
	  B-->>C2: Assign partitions
	  Note over C1,C2: Rebalance happens again when membership changes

7) Idempotence and transactions (exam-level understanding)

Idempotent producer

Idempotence is about preventing duplicates caused by retries. It helps when the network is unreliable or broker responses are delayed.

Transactions (EOS building block)

Transactions let you write to multiple partitions/topics atomically and (with consumer settings) avoid reading uncommitted data.

Concept What it means Remember
Transactional producer Writes in a transaction Requires transactional.id
read_committed consumer Reads only committed data Hides aborted tx records
read_uncommitted Reads everything May see aborted records

8) Serialization + Schema Registry (high-yield basics)

Serialization chooser

Format Pros Cons Typical use
JSON Human-readable, easy No strict schema; larger payloads Prototyping, logs
Avro Compact + schema Requires schema management Strong default for evolving event contracts
Protobuf Compact + strong types Tooling complexity Typed contracts across languages

Schema evolution mindset (compatibility)

Compatibility answers: “Can old consumers read new data?” and “Can new consumers read old data?”

Mode Safe for… Mental model
BACKWARD New schema reads old data New consumers ok
FORWARD Old schema reads new data Old consumers ok
FULL Both directions Safest, strictest
NONE No guarantees Fast iteration, high risk

9) Troubleshooting quick pickers

  • Consumer lag rising: too few partitions, slow processing, fetch/poll tuning, or downstream bottleneck.
  • Frequent rebalances: session.timeout.ms too low, processing blocks too long (max.poll.interval.ms), unstable membership.
  • Duplicates observed: retries without idempotence; commits after processing missing idempotency; producer retries due to timeouts.
  • Out-of-order events: multiple partitions for same key/entity; key missing/changed; multiple topics merged without ordering guarantees.

10) Mini-glossary

Broker (Kafka server) • Topic (stream name) • Partition (ordered shard) • Offset (position) • Consumer group (parallel readers) • ISR (in-sync replicas) • Rebalance (partition reassignment) • Idempotence (safe retries) • Transaction (atomic writes).