KDB+/Q for Java Developers: Reading the Matrix

KDB+ is used in risk analytics, trade surveillance, and market data storage across most tier-1 financial institutions. If you work in finance long enough, you will encounter it. Nothing in your Java background prepares you for it.

What KDB+ Is

KDB+ is a column-oriented in-memory database with a built-in query language called Q. It was designed from the ground up for time-series data and is fast by a significant margin over conventional databases for that use case.

It’s not open source. It’s not cheap. It runs on a single thread (mostly). It has a syntax that looks like line noise to anyone coming from C-family languages. And it’s the tool of choice for intra-day analytics in finance because nothing else comes close on performance for the specific access patterns involved.

The Q Language: Mental Model Reset Required

Q is an array-oriented functional language. Every operator works on entire arrays, not individual elements. This is the fundamental shift:

/ Java mental model: loop over elements
for (int i = 0; i < prices.length; i++) {
    result[i] = prices[i] * quantities[i];
}

/ Q mental model: operate on entire vectors
result: prices * quantities   / element-wise multiply, no loop

A Q function that takes 1ms to process 1 element usually takes 1.1ms to process 1 million elements. The overhead is function call setup; the work scales with SIMD vectorisation, not loop iterations.

Basic types and their Java equivalents:

Q type	Java equivalent	Notes
`int`	`int`	32-bit, null is `0Ni`
`long`	`long`	64-bit, null is `0Nj`
`float` / `real`	`double` / `float`
`symbol`	interned `String`	Like an enum — stored as integer ID
`timestamp`	`Instant`	Nanosecond precision, Q epoch = 2000-01-01
`list`	`int[]` (homogeneous)
`table`	column-oriented `DataFrame`	key difference from Java
`dictionary`	`Map<K,V>`

Tables Are Columnar Dictionaries

The Q table is the central abstraction. Unlike SQL tables (rows of records), Q tables are column-oriented by default:

/ A table is a dictionary from column names to column vectors
trades: ([]
    time: 09:30:01 09:30:03 09:30:07 09:30:15;
    sym:  `EURUSD`GBPUSD`EURUSD`EURUSD;
    side: `BUY`SELL`BUY`SELL;
    price: 1.0842 1.2731 1.0843 1.0841;
    qty: 1000000 500000 2000000 750000
)

This is stored as 5 parallel arrays, not as a list of row objects. Querying select sum qty from trades where sym=\EURUSDreads only thesymandqty` columns — the other 3 are never touched. Exactly the columnar access pattern that makes analytics fast.

Q-SQL looks familiar but executes differently from SQL:

/ Q-SQL: select with aggregation and grouping
select sum qty, avg price by sym, side from trades

/ With time bucketing (10-minute bars)
select sum qty, last price by sym, 10 xbar time.minute from trades

/ Joining to reference data (symbol table)
select sum qty, ref.fullName by sym from trades lj `sym xkey ref

The by clause is grouping. xbar is bucketing (round down to nearest multiple). lj is left join on a keyed table. The syntax is terse — Q is designed to fit complex queries on one line.

Time-Series Operations

Where Q genuinely has no peer: time-series joins and as-of lookups.

As-of join (aj): join each trade to the most recent quote before that trade’s timestamp:

/ quotes table: timestamp, sym, bid, offer
/ trades table: timestamp, sym, price, qty
/ For each trade, find the quote that was current at trade time

aj[`sym`time; trades; quotes]
/ Result: trades with bid and offer columns added from the most recent matching quote

In SQL this would be a correlated subquery or a LATERAL join — complex, slow, and often requiring a self-join. In Q it’s a primitive operation, implemented in C, operating on sorted column arrays. At billions of rows, it’s still fast.

TWAP / VWAP computation:

/ Volume-Weighted Average Price in 1-minute bars
vwap: select vwap: sum[price*qty]%sum qty by sym, 1 xbar time.minute from trades

The % is division in Q (not modulo — that’s mod). This query processes the entire trade history in one pass. No subqueries, no window functions syntax overhead.

The Performance Reality

Why KDB+ is fast:

Column vectors in contiguous memory: cache-friendly sequential reads, SIMD-vectorisable
Sorted data assumptions: time-series data is typically sorted by time; Q assumes this and uses binary search for time-range filters
Interns symbols: string comparisons become integer comparisons
Minimal overhead: the Q interpreter is written in C and K (a more primitive language), with no JVM, no GC, minimal allocation

Typical performance on modern hardware:

Operation	Row count	Time
Sum of a long column	1 billion	~1.5s
VWAP by instrument, 1-min bars	500M trades	~3s
As-of join trades to quotes	100M trades × 1B quotes	~8s
Time-range filter + aggregate	10M rows	~50ms

These numbers assume data fits in RAM (KDB+ is primarily an in-memory database, with a separate disk-based HDB for historical data). For our risk analytics use case, the intra-day data fit in 64GB RAM; the 5-year historical data lived on the HDB (Historical Database) and queries against it were significantly slower.

Connecting from Java

The Java KDB+ client (kx.c) connects via TCP and allows you to execute Q expressions remotely and receive results:

1
2
3
4
5
6
7
8
9
c conn = new c("kdb-host", 5000, "user:password");

// Execute Q and receive result as Java object
Object result = conn.k("select sum qty by sym from trades where date=.z.d");

// Result is a c.Flip (table) — convert to usable form
c.Flip table = (c.Flip) result;
String[] syms     = (String[]) table.y[0]; // column 0
long[]   qtys     = (long[])   table.y[1]; // column 1

The c.Flip is Q’s table in Java — arrays of column vectors. No ORM, no JDBC driver with row-at-a-time fetching. You get back the raw column arrays and work with them directly.

For high-frequency queries, keep the connection open and reuse it. Each new c(...) is a TCP connection with session setup overhead. Connection pools work but add complexity; for a single analytics process a long-lived connection is fine.

When Not to Use KDB+

KDB+ is specialised. It’s wrong for:

Transactional workloads: no multi-statement transactions, optimistic locking, or row-level updates in the traditional sense
Arbitrary write patterns: Q tables are effectively append-only in normal operation; in-place updates exist but are expensive
Teams without Q expertise: the learning curve is steep and the talent pool is small
Data that doesn’t fit the time-series model: it’s a specialised tool, not a general-purpose database

For everything that doesn’t fit — current positions, reference data, operational state — we used Postgres. KDB+ handled analytics; Postgres handled transactions. The right tool for the right access pattern, a theme that repeats through every data architecture I’ve worked on.

What KDB+ Is#

The Q Language: Mental Model Reset Required#

Tables Are Columnar Dictionaries#

Time-Series Operations#

The Performance Reality#

Connecting from Java#

When Not to Use KDB+#