By early 2015 we had both Chronicle Queue and Kafka in production — Chronicle for intra-day trade journaling, Kafka for end-of-day data pipelines. The question came up repeatedly: why not use one for both? The answer is that they solve different problems with incompatible design priorities.
Design Philosophy
| Dimension | Chronicle Queue | Apache Kafka |
|---|---|---|
| Primary goal | Ultra-low latency, single machine | High-throughput, distributed, durable |
| Deployment | Library in your JVM process | Separate cluster (broker + ZooKeeper) |
| Persistence | Memory-mapped files, local disk | Replicated log, multiple brokers |
| Consumers | Unlimited readers, zero coordination | Consumer groups, offset tracking |
| Replication | None (single machine) | Configurable (min 1, default 3) |
| Ordering | Strict global order | Strict per-partition order |
| Latency | 20–200ns write, <1µs read | 1–10ms end-to-end (acks=all) |
| Throughput | ~100M msgs/s (local memory) | ~1M msgs/s (network-bound) |
| Operational cost | Zero (embedded) | High (cluster management) |
These are not competing products in the same category. They’re tools with overlapping capabilities and very different tradeoffs.
Chronicle Queue: The Local Fast Lane
Chronicle Queue is an embedded library — no network, no broker, no separate process. You write to it and read from it within the same machine (or across machines via shared NFS/NAS, though that’s uncommon).
The write path:
appender.writeDocument(w ->
w.write("price").float64(1.08451)
.write("qty").int64(1_000_000));
Internally: acquire the next position in the memory-mapped file, write the binary-encoded bytes directly into mapped memory via Unsafe.putLong(). No syscall. No serialisation intermediate buffer. The OS page cache absorbs the write; the data is immediately readable by any tailer on the same machine.
Write latency distribution (measured, Xeon E5-2687W, NVMe SSD):
p50: 45ns
p99: 180ns
p999: 2,100ns ← occasional OS page fault or scheduler preemption
p9999: 18,000ns ← rare, usually GC or OS flush
The p999 and p9999 spike from the OS occasionally needing to fault in a new memory-mapped page or flush the page cache. These are manageable — pre-touching pages and configuring vm.dirty_* kernel parameters reduces them significantly.
Chronicle Queue is the right choice when:
- Latency budget is sub-millisecond and the log is intra-process or intra-machine
- You need a durable ordered log without a network hop
- Single-machine reliability is acceptable (no broker replication)
- The data volume fits on local disk
For us: trade event journaling, order state persistence, price history for the current trading day. If the machine died, we’d recover from the exchange’s records. The local journal was for operational use — replaying the day’s events to debug an issue, feeding the risk engine’s position reconstruction.
Kafka: The Distributed Pipeline
Kafka is a distributed log: write to a broker, the broker replicates to followers, consumers pull from any broker in the cluster. It’s designed for scenarios where the producer and consumer don’t share a machine, where you need multiple independent consumers, where data must survive broker failure, and where throughput is more important than nanosecond latency.
Write latency with acks=all (all replicas must acknowledge):
p50: 4ms
p99: 18ms
p999: 95ms
That’s 4ms at median, versus 45ns for Chronicle. For intra-day trade journaling with a sub-millisecond SLA, Kafka is not an option.
Kafka is the right choice when:
- Data must survive the producing machine dying (replication)
- Multiple teams/systems consume the same data independently
- Producer and consumer are on different machines (network is unavoidable)
- You need > 1 physical machine worth of storage
- Latency tolerance is milliseconds, not microseconds
For us: end-of-day risk reporting (trades from all systems → Kafka → risk analytics), market data archival (prices → Kafka → HDFS), regulatory reporting pipeline.
The Pattern We Used
Intra-day (latency-sensitive): End-of-day (throughput, durability):
Trade events → Chronicle Queue Chronicle Queue → Kafka Producer → Kafka
│ (async bridge)
Risk engine (tailer)
Debug replay (tailer)
Order tracking (tailer)
At market close, a bridge process read the day’s Chronicle Queue and published everything to Kafka. The bridge ran asynchronously and didn’t affect the intra-day latency path. Kafka retained the data for 30 days for regulatory purposes and fed the downstream analytics systems.
What Chronicle Queue Can’t Do
- Multi-machine distribution: data lives on local disk. No broker, no replication. If the disk fails, data is gone.
- Multiple independent consumer offsets: Chronicle tailers share the physical file and each maintain their own read position. This works, but it’s not the managed consumer-group model Kafka provides.
- Topic-based routing: one Chronicle Queue is one ordered log. Kafka’s topic/partition model provides more flexible routing.
- Long-term retention at scale: Chronicle files grow until you delete them. At 100M messages/day, you need active management. Kafka’s retention policies handle this automatically with log compaction and segment deletion.
What Kafka Can’t Do
- Sub-millisecond write latency: network round-trips and replication are irreducible.
- Zero-copy local read: Chronicle’s memory-mapped access is faster than anything that crosses a network.
- Embedded deployment: Kafka requires a cluster; Chronicle is a library dependency.
- Deterministic tail latency: Kafka’s p999 is dominated by network and replication jitter.
The tools are complementary. Using Kafka for everything sacrifices the latency properties that make the intra-day path work; using Chronicle for everything sacrifices the durability and distribution properties that make the end-of-day pipeline reliable. Running both, with a bridge, gives you the right tool for each context.