Choosing a Time-Series Database in 2020

The fintech startup had three different time-series storage problems at the same time. After evaluating the options available in 2020, we ended up running two different systems. Here’s the decision framework and why the landscape fragmented the way it did.

The Three Access Patterns

Time-series workloads look similar at the surface but split into distinct access patterns:

Monitoring telemetry: millions of data points per second, high write throughput, queries are almost always recent data (last 5 minutes, last hour), high cardinality (one series per pod per metric), retention in days to weeks. The system needs to be fast at ingestion and at “what is the current value / what was it 5 minutes ago?” queries. Aggregations are usually simple (rate, average, percentile over time windows).

Business metrics: lower write throughput, longer retention (months to years), complex queries (arbitrary GROUP BY, joins to other data, ad-hoc aggregations). The system needs to be fast at analytical queries and survive schema evolution as the business changes what it wants to measure.

Event streams: high write throughput, time-ordered append-only, queries are almost always “all events in time range” or “events matching predicate in time range.” Analytical roll-ups are secondary. Retention can be very long.

These are different enough that a system optimised for one performs poorly for another.

The Candidates in 2020

InfluxDB (OSS 1.x / 2.x): purpose-built time series. Excellent write throughput, good compression, purpose-built query language (InfluxQL / Flux). The cardinality limit is the known weakness: high-cardinality tag combinations (e.g., pod_id × metric_name × endpoint) cause memory issues and query degradation. Retention policies are built-in. Strong ecosystem (Telegraf, Grafana).

TimescaleDB: PostgreSQL extension. You get SQL, foreign keys, joins to relational data, mature backup/recovery. The time-series-specific features (hypertables, continuous aggregates, compression) are solid. The cost: you’re still running PostgreSQL, with PostgreSQL’s operational model. Not the best write throughput for pure time series.

ClickHouse: columnar OLAP database. Exceptional read performance for analytical queries. Write throughput is high if you batch (individual row inserts are inefficient). Not purpose-built for time series but handles it well for analytical workloads. SQL dialect (with some quirks). Compression ratios are excellent. Operational complexity is moderate; no managed offering was mature in 2020.

Prometheus: monitoring-specific, pull-based, built around the recording rule / alerting model. Not designed for long-term storage or high-cardinality raw queries. Correct for what it does (infrastructure monitoring, alerting); wrong for anything else.

InfluxDB Cloud / managed options: the managed TSDB landscape was earlier in 2020 than it is now. Timescale Cloud existed, InfluxDB Cloud existed, neither was mature. We ran everything self-hosted.

Our Three Problems and What We Chose

Problem 1: Infrastructure monitoring. Pod metrics, service latency, error rates. Millions of data points per day. Queries almost always recent. Needs alerting integration.

Decision: Prometheus. The right tool for this job. We ran Prometheus with a 30-day retention, used Thanos for long-term storage when we needed it. Grafana on top. The cardinality was manageable because infrastructure metrics have bounded label sets.

Problem 2: Market data storage. We ingested normalised price updates from multiple venues — roughly 100k–200k data points per second during market hours. Needed to retain months of data for model training and backtesting. Queries were a mix: “all EUR/USD prices in this 5-minute window” (range queries) and “average spread by venue by hour over the last month” (analytical).

Decision: ClickHouse. The write throughput with batching (500 rows, 50ms window) was easily adequate. The columnar storage compressed market data exceptionally well — each price update was roughly 50 bytes uncompressed, 8–12 bytes after compression. Analytical queries over months of data ran in seconds. We partitioned by day and used an ORDER BY (instrument, timestamp) primary key for range query performance.

The operational investment: ClickHouse required more care than a managed database. Shard planning, backup configuration, replication setup. Worth it for the query performance.

Problem 3: Business metrics. Monthly active users, transaction counts by product type, revenue metrics. Moderate volume (thousands of data points per day). Arbitrary ad-hoc queries from analysts. Joins to customer and product tables.

Decision: TimescaleDB. The business metrics needed to join to relational data in PostgreSQL — customer IDs, product definitions, region mappings. Running a separate time-series database for this would have required ETL to bring the relational data over. TimescaleDB let us keep both in PostgreSQL with a single query. The write volume was trivially within PostgreSQL’s capabilities.

The Failure Modes We Encountered

ClickHouse merge pressure: ClickHouse uses an LSM-style merge tree. At very high insert rates, inserts create parts faster than background merges can coalesce them. The symptom: “too many parts” errors, query degradation. Fix: increase batch size, reduce insert frequency, tune the merge settings. We’d been inserting too frequently with small batches.

Prometheus cardinality explosion: a developer added a label with a high-cardinality value (request ID) to a metric. Prometheus memory usage spiked from 2GB to 18GB in hours. Fix: remove the label, restart (losing some recent metrics history), add a CI check that validates metric labels against a cardinality budget.

TimescaleDB chunk retention: TimescaleDB chunks (the time-partitioned sub-tables) accumulate. We hadn’t configured a retention policy, and after six months, the chunk count was large enough to affect query planning. Fix: set add_retention_policy('metrics_table', interval '6 months').

What I’d Tell Someone Starting in 2020

Don’t use your relational database for time series at scale. PostgreSQL with a timestamp index will work up to a few million rows. Past that, TimescaleDB (same database, better), or a purpose-built TSDB.
Prometheus is for monitoring, not storage. If you need to query raw metrics from 6 months ago, you need something else (Thanos, Cortex, or a separate store).
ClickHouse is exceptional for analytical time series if you can accept operational complexity. The query speed is hard to match.
InfluxDB works if your cardinality is bounded. If you’re building a SaaS product where each customer has their own series, InfluxDB’s cardinality limits will bite you.
Evaluate on your actual access patterns. The “benchmark” numbers on vendor websites are optimised for their strongest case. Run your actual query patterns against each candidate.

By 2020 the right answer was clearly “it depends on which of the three problems you have.” The mistake is picking one system for all three and discovering later that it’s wrong for two of them.

The Three Access Patterns#

The Candidates in 2020#

Our Three Problems and What We Chose#

The Failure Modes We Encountered#

What I’d Tell Someone Starting in 2020#

The Three Access Patterns

The Candidates in 2020

Our Three Problems and What We Chose

The Failure Modes We Encountered

What I’d Tell Someone Starting in 2020