The analytics team’s query: “Give me total notional, average spread, and fill rate for every instrument over the last 90 days, broken down by hour.” On our Postgres trade history table with ~2 billion rows: 4 hours, 23 minutes. After the columnar rewrite: 8 seconds.

This post is about why, not how to install Parquet.

Why Row Storage Is Wrong for Analytics

A row-oriented store (Postgres, MySQL, most OLTP databases) lays data out like this:

Row 1: [trade_id=1001, instrument="EURUSD", price=1.0842, qty=1000000, side=BUY,  ts=09:31:02, ...]
Row 2: [trade_id=1002, instrument="GBPUSD", price=1.2731, qty=500000,  side=SELL, ts=09:31:03, ...]
Row 3: [trade_id=1003, instrument="EURUSD", price=1.0843, qty=2000000, side=BUY,  ts=09:31:04, ...]

Each row is contiguous in storage. To answer “what is the total notional (price × qty) for all EURUSD trades?” the database must:

  1. Read every row (to check the instrument filter)
  2. For each matching row, extract price and qty
  3. Compute price × qty

If the table has 50 columns and you need 3 of them, you’re reading 47 unnecessary columns from disk. At 2 billion rows with ~200 bytes/row, that’s ~400GB of I/O for a query touching 12 bytes/row of actual data.

A column-oriented store lays data out like this:

instrument column: ["EURUSD", "GBPUSD", "EURUSD", "USDJPY", ...]  ← all instrument values together
price column:      [1.0842, 1.2731, 1.0843, 110.23, ...]
qty column:        [1000000, 500000, 2000000, 3000000, ...]
ts column:         [09:31:02, 09:31:03, 09:31:04, 09:31:05, ...]

For the same query:

  1. Read only instrument, price, qty columns — skip everything else
  2. Filter the instrument column for “EURUSD” — produces a bitmap of matching row indices
  3. Apply the bitmap to price and qty columns — read only the matching values

Actual I/O: ~24 bytes × matching rows. For 10% selectivity on 2 billion rows: ~480MB instead of 400GB. That’s the 90 seconds vs 4 hours gap.

The Compression Multiplier

Columnar storage compresses dramatically better than row storage because similar values are adjacent.

instrument column (row storage adjacent):    "EURUSD", "GBPUSD", "EURUSD", "USDJPY"  ← varied
instrument column (columnar):                "EURUSD", "EURUSD", "EURUSD", ..., "GBPUSD", "GBPUSD"  ← sorted, highly repetitive

A sorted instrument column with ~200 distinct values across 2 billion rows compresses to near zero with dictionary encoding: store the 200 strings once, use a 1-byte integer per row. 2 billion rows of instrument data → ~2GB uncompressed → ~2MB with dictionary encoding.

Compression ratios we saw in practice:

Column typeUncompressedParquet (Snappy)Ratio
Instrument (200 values)16 bytes/row0.001 bytes/row16,000×
Price (float64)8 bytes/row4–6 bytes/row1.5–2×
Timestamp (int64)8 bytes/row2–3 bytes/row3–4×
Side (BUY/SELL)8 bytes/row0.125 bytes/row64×
Free text (description)variable0.4–0.8×Not great

The instrument compression alone moved the 90-day analytics dataset from ~400GB to under 20GB. That fits in RAM on a modern analytics server.

Parquet vs Arrow: Two Different Things

These are often mentioned together but serve different purposes:

AspectApache ParquetApache Arrow
Primary useOn-disk columnar storageIn-memory columnar format
Optimised forSequential read from diskRandom access, zero-copy
EncodingCompressed (Snappy, Zstd)Uncompressed (direct memory)
InteropFile format standardIPC protocol standard
Use together?Yes: read Parquet into Arrow, process in ArrowYes

The typical pattern in an analytics pipeline:

Parquet files (S3/HDFS)
       ↓  read + decompress
Arrow RecordBatch (in memory)
       ↓  vectorised compute (Arrow compute kernels, DuckDB, etc.)
Result
       ↓  optional: write back as Parquet

We used this pattern for end-of-day risk reporting: the prior day’s trades were Parquet files in S3; the risk calculation loaded them into Arrow in-memory format and ran vectorised aggregations using Arrow’s compute API. Total time for 90-day rolling risk: reduced from 4 hours to 9 minutes.

Predicate Pushdown: The Hidden Win

Parquet files store metadata (min/max values, null counts) per row group (default 128MB chunks). Query engines that understand Parquet use this to skip entire row groups that can’t possibly satisfy the filter:

Query: SELECT * FROM trades WHERE ts BETWEEN '09:30:00' AND '09:31:00'

Row group 1: ts range [08:00:00, 08:59:59] → SKIP (entirely before filter)
Row group 2: ts range [09:00:00, 09:29:59] → SKIP
Row group 3: ts range [09:30:00, 09:34:22] → READ (overlaps filter range)
Row group 4: ts range [09:35:00, 10:00:00] → SKIP

For a time-filtered query on sorted-by-time Parquet files, 95% of the data is skipped without reading it. This requires writing Parquet files sorted by the most common filter column — for trading data, timestamp is the natural choice.

What We Actually Deployed

Architecture that replaced the Postgres analytics queries:

Trade events (Kafka)
      ↓  Flink job (hourly rollup)
Parquet files in S3  (partitioned by YYYY/MM/DD/HH)
      ↓  Athena (ad-hoc SQL)  /  custom Arrow pipeline (scheduled reports)
Analytics dashboard / risk reports

Postgres remained for OLTP: current positions, live order state, reference data. Anything requiring transactional consistency or row-level updates stayed in Postgres. The Parquet layer was append-only and immutable by design — no updates, only new files.

This separation (OLTP in Postgres, OLAP in Parquet/Arrow) is the correct architecture for this workload. The mistake is trying to use one system for both access patterns.