The analytics team’s query: “Give me total notional, average spread, and fill rate for every instrument over the last 90 days, broken down by hour.” On our Postgres trade history table with ~2 billion rows: 4 hours, 23 minutes. After the columnar rewrite: 8 seconds.
This post is about why, not how to install Parquet.
Why Row Storage Is Wrong for Analytics
A row-oriented store (Postgres, MySQL, most OLTP databases) lays data out like this:
Row 1: [trade_id=1001, instrument="EURUSD", price=1.0842, qty=1000000, side=BUY, ts=09:31:02, ...]
Row 2: [trade_id=1002, instrument="GBPUSD", price=1.2731, qty=500000, side=SELL, ts=09:31:03, ...]
Row 3: [trade_id=1003, instrument="EURUSD", price=1.0843, qty=2000000, side=BUY, ts=09:31:04, ...]
Each row is contiguous in storage. To answer “what is the total notional (price × qty) for all EURUSD trades?” the database must:
- Read every row (to check the
instrumentfilter) - For each matching row, extract
priceandqty - Compute
price × qty
If the table has 50 columns and you need 3 of them, you’re reading 47 unnecessary columns from disk. At 2 billion rows with ~200 bytes/row, that’s ~400GB of I/O for a query touching 12 bytes/row of actual data.
A column-oriented store lays data out like this:
instrument column: ["EURUSD", "GBPUSD", "EURUSD", "USDJPY", ...] ← all instrument values together
price column: [1.0842, 1.2731, 1.0843, 110.23, ...]
qty column: [1000000, 500000, 2000000, 3000000, ...]
ts column: [09:31:02, 09:31:03, 09:31:04, 09:31:05, ...]
For the same query:
- Read only
instrument,price,qtycolumns — skip everything else - Filter the
instrumentcolumn for “EURUSD” — produces a bitmap of matching row indices - Apply the bitmap to
priceandqtycolumns — read only the matching values
Actual I/O: ~24 bytes × matching rows. For 10% selectivity on 2 billion rows: ~480MB instead of 400GB. That’s the 90 seconds vs 4 hours gap.
The Compression Multiplier
Columnar storage compresses dramatically better than row storage because similar values are adjacent.
instrument column (row storage adjacent): "EURUSD", "GBPUSD", "EURUSD", "USDJPY" ← varied
instrument column (columnar): "EURUSD", "EURUSD", "EURUSD", ..., "GBPUSD", "GBPUSD" ← sorted, highly repetitive
A sorted instrument column with ~200 distinct values across 2 billion rows compresses to near zero with dictionary encoding: store the 200 strings once, use a 1-byte integer per row. 2 billion rows of instrument data → ~2GB uncompressed → ~2MB with dictionary encoding.
Compression ratios we saw in practice:
| Column type | Uncompressed | Parquet (Snappy) | Ratio |
|---|---|---|---|
| Instrument (200 values) | 16 bytes/row | 0.001 bytes/row | 16,000× |
| Price (float64) | 8 bytes/row | 4–6 bytes/row | 1.5–2× |
| Timestamp (int64) | 8 bytes/row | 2–3 bytes/row | 3–4× |
| Side (BUY/SELL) | 8 bytes/row | 0.125 bytes/row | 64× |
| Free text (description) | variable | 0.4–0.8× | Not great |
The instrument compression alone moved the 90-day analytics dataset from ~400GB to under 20GB. That fits in RAM on a modern analytics server.
Parquet vs Arrow: Two Different Things
These are often mentioned together but serve different purposes:
| Aspect | Apache Parquet | Apache Arrow |
|---|---|---|
| Primary use | On-disk columnar storage | In-memory columnar format |
| Optimised for | Sequential read from disk | Random access, zero-copy |
| Encoding | Compressed (Snappy, Zstd) | Uncompressed (direct memory) |
| Interop | File format standard | IPC protocol standard |
| Use together? | Yes: read Parquet into Arrow, process in Arrow | Yes |
The typical pattern in an analytics pipeline:
Parquet files (S3/HDFS)
↓ read + decompress
Arrow RecordBatch (in memory)
↓ vectorised compute (Arrow compute kernels, DuckDB, etc.)
Result
↓ optional: write back as Parquet
We used this pattern for end-of-day risk reporting: the prior day’s trades were Parquet files in S3; the risk calculation loaded them into Arrow in-memory format and ran vectorised aggregations using Arrow’s compute API. Total time for 90-day rolling risk: reduced from 4 hours to 9 minutes.
Predicate Pushdown: The Hidden Win
Parquet files store metadata (min/max values, null counts) per row group (default 128MB chunks). Query engines that understand Parquet use this to skip entire row groups that can’t possibly satisfy the filter:
Query: SELECT * FROM trades WHERE ts BETWEEN '09:30:00' AND '09:31:00'
Row group 1: ts range [08:00:00, 08:59:59] → SKIP (entirely before filter)
Row group 2: ts range [09:00:00, 09:29:59] → SKIP
Row group 3: ts range [09:30:00, 09:34:22] → READ (overlaps filter range)
Row group 4: ts range [09:35:00, 10:00:00] → SKIP
For a time-filtered query on sorted-by-time Parquet files, 95% of the data is skipped without reading it. This requires writing Parquet files sorted by the most common filter column — for trading data, timestamp is the natural choice.
What We Actually Deployed
Architecture that replaced the Postgres analytics queries:
Trade events (Kafka)
↓ Flink job (hourly rollup)
Parquet files in S3 (partitioned by YYYY/MM/DD/HH)
↓ Athena (ad-hoc SQL) / custom Arrow pipeline (scheduled reports)
Analytics dashboard / risk reports
Postgres remained for OLTP: current positions, live order state, reference data. Anything requiring transactional consistency or row-level updates stayed in Postgres. The Parquet layer was append-only and immutable by design — no updates, only new files.
This separation (OLTP in Postgres, OLAP in Parquet/Arrow) is the correct architecture for this workload. The mistake is trying to use one system for both access patterns.