Time-Series Data at a Bank: Why Relational Databases Break and What Comes Next

When I moved to the large financial institution, the team I joined managed the market data and trade data storage layer. The engineering problem was deceptively simple to state: store every price tick, every trade execution, and every risk calculation — billions of records per day — and answer analytical queries over them quickly.

The existing system was PostgreSQL. It worked, technically. Queries that needed to run in seconds for trading decisions took minutes. Operational costs for storage were climbing. The database team was spending more time running VACUUM than building features.

Understanding why required understanding what time-series data actually is and why it’s different.

What Makes Time-Series Data Different

A relational database is optimised for transactional workloads:

Random access by primary key
Updates and deletes (ACID guarantees)
Joins across normalised tables
Moderate row counts with arbitrary access patterns

Time-series data is the opposite:

Monotonically increasing timestamps (always appending, rarely updating)
Queries almost always range over time (WHERE timestamp BETWEEN ...)
High cardinality in the time dimension (millions of distinct timestamps)
Analytical aggregations (OHLC, VWAP, rolling averages) over large ranges

Typical time-series query pattern:
  SELECT symbol, avg(price), stddev(price)
  FROM market_data
  WHERE symbol IN ('EUR/USD', 'GBP/USD')
    AND timestamp >= '2016-07-06 09:00:00'
    AND timestamp <  '2016-07-06 17:00:00'
  GROUP BY symbol, date_trunc('minute', timestamp)

Row count scanned: ~5M rows for 8 hours × 2 pairs × 10K ticks/hour

PostgreSQL scans these rows using a B-tree index on timestamp. B-tree indexes are great for point lookups; for range scans over millions of rows, you end up with sequential I/O that could be much faster with columnar storage.

Why Columnar Storage Wins Here

Row-based storage (PostgreSQL, MySQL) stores each record together:

Row 1: [timestamp][symbol][bid][ask][venue][size]
Row 2: [timestamp][symbol][bid][ask][venue][size]
...

To compute avg(bid) over 5M rows, you read all 6 columns for all 5M rows, then discard 5 of the 6 columns.

Columnar storage separates each column:

timestamps:  [t1][t2][t3][t4]...
symbols:     [EUR/USD][EUR/USD][GBP/USD]...
bids:        [1.2844][1.2845][1.3102]...
asks:        [1.2845][1.2846][1.3103]...

To compute avg(bid) over 5M rows, you read only the bid column — 1/6 the I/O. Better: the bid column stores doubles contiguously, which compresses well (delta encoding, float compression) and vectorises cleanly in the CPU (SIMD operations on contiguous doubles).

The performance difference for analytical queries: 10–100× for typical finance workloads.

KDB+/Q: The Finance-Specific Solution

KDB+ (from First Derivatives) is a column-oriented database designed from the ground up for financial time-series. It’s used by most major investment banks and hedge funds for market data storage.

The key characteristics:

In-memory column store: KDB+ keeps “hot” data (recent history) in RAM as native column vectors. A query over today’s EUR/USD prices iterates over a contiguous double array — this is as fast as data retrieval can get.

On-disk segmentation: historical data is stored in per-date, per-symbol directories of binary files. Each column is a separate file. Loading historical data into memory is a direct read() into pre-allocated buffers — no parsing, no deserialization.

/data/
  2016.07.06/
    marketdata/
      sym     ← symbol enum file (EUR/USD → 1, GBP/USD → 2, ...)
      time    ← timestamp column (binary)
      bid     ← double column (binary)
      ask     ← double column (binary)
      sz      ← long column (binary)

Q language: a vector-oriented language designed for operations on entire columns:

// Load today's EUR/USD data:
eurusd: select from marketdata where date=2016.07.06, sym=`EUR/USD

// Compute 1-minute OHLC:
ohlc: select o: first bid, h: max bid, l: min bid, c: last bid
      by 1 xbar time.minute
      from eurusd

// VWAP (volume-weighted average price):
vwap: select vwap: sz wavg bid from eurusd

The wavg function (weighted average) operates on entire vectors — it’s a single BLAS-style call on contiguous memory, not a loop in the Q interpreter.

Query performance:

Query type                           PostgreSQL    KDB+
─────────────────────────────────────────────────────
1-min OHLC, 1 day, 1 symbol          45s          0.3s
VWAP, 1 day, all symbols             18m          2.1s
Tick data load, 1 day               n/a            0.8s
Cross-symbol correlation, 1 week    n/a            8.4s

The “n/a” entries: PostgreSQL couldn’t finish these queries within a reasonable timeout.

The Operational Model

KDB+ requires a different operational model than PostgreSQL:

Memory-first: the database assumes recent data lives in RAM. The process typically has 100GB–1TB of memory. Q’s garbage collector is manual — you need to understand Q memory management to avoid crashes.

Single-threaded: by default, a KDB+ process is single-threaded (though distributed workloads are supported). Complex queries block the process. For high-availability, you run multiple read processes behind a load balancer.

No ACID transactions: KDB+ is append-only. Updates to historical data require deleting and re-ingesting. Suitable for market data (immutable); problematic for trade data (subject to corrections).

Licensing: KDB+ is commercial and expensive. This was acceptable at the bank but impractical at smaller organisations.

What the Post-KDB+ World Looks Like

By 2016, the open-source ecosystem had caught up for many use cases:

System	Best for	Not for
InfluxDB	Metrics, monitoring, IoT	Large joins, complex analytics
TimescaleDB	SQL familiarity, moderate scale	Extreme performance needs
Apache Parquet + Spark	Batch analytics, large-scale	Low-latency queries
ClickHouse	Fast OLAP, column scans	Time-series ordering guarantees
Arctic (Man Group)	Python/Pandas, quant research	Production trading

At the institution, KDB+ remained for the latency-sensitive market data path. ClickHouse was introduced for regulatory reporting and analytical workloads where the query latency requirement was seconds rather than milliseconds. This split — KDB+ for real-time, ClickHouse for analytics — turned out to be a clean separation that served both needs well.

The core lesson: match the data model to the access pattern. Time-series data with analytical query requirements is a bad fit for row-oriented transactional databases. The tooling has matured enough that there’s no longer a good reason to use a general-purpose relational database as a time-series store.

What Makes Time-Series Data Different#

Why Columnar Storage Wins Here#

KDB+/Q: The Finance-Specific Solution#

The Operational Model#

What the Post-KDB+ World Looks Like#

What Makes Time-Series Data Different

Why Columnar Storage Wins Here

KDB+/Q: The Finance-Specific Solution

The Operational Model

What the Post-KDB+ World Looks Like