Building a High-Throughput Event Pipeline in Go Without Losing Your Mind

The fintech startup processed market events at sustained rates of 50,000–200,000 per second through a normalisation and enrichment pipeline. Go channels and goroutines are the natural tool. The naive implementation falls apart at scale. Here’s what works.

The Problem with Naive Pipelines

The textbook Go pipeline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
func pipeline(input <-chan Event) <-chan EnrichedEvent {
    out := make(chan EnrichedEvent)
    go func() {
        for event := range input {
            out <- enrich(event)
        }
        close(out)
    }()
    return out
}

Problems at high throughput:

Channel overhead: each send/receive involves a goroutine scheduling point. At 100k events/sec, this is 200k scheduler operations per second on this stage alone.
No batching: each event is processed individually. Many operations (database lookups, Kafka writes) are much more efficient in batches.
No backpressure visibility: the pipeline blocks silently. You can’t observe how full the queues are.
GC pressure: if EnrichedEvent is a struct with pointers, 100k/sec is 100k allocations/sec for this stage.

Batching at Stage Boundaries

Process events in batches where possible:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
func batchEnrich(input <-chan []Event) <-chan []EnrichedEvent {
    out := make(chan []EnrichedEvent, 4)
    go func() {
        defer close(out)
        for batch := range input {
            enriched := make([]EnrichedEvent, len(batch))
            for i, e := range batch {
                enriched[i] = enrich(e)
            }
            out <- enriched
        }
    }()
    return out
}

But batching requires assembling the batch somewhere. A collector stage:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
func collect(input <-chan Event, batchSize int, maxWait time.Duration) <-chan []Event {
    out := make(chan []Event, 4)
    go func() {
        defer close(out)
        batch := make([]Event, 0, batchSize)
        timer := time.NewTimer(maxWait)
        defer timer.Stop()

        flush := func() {
            if len(batch) > 0 {
                out <- batch
                batch = make([]Event, 0, batchSize)
            }
        }

        for {
            select {
            case e, ok := <-input:
                if !ok {
                    flush()
                    return
                }
                batch = append(batch, e)
                if len(batch) >= batchSize {
                    flush()
                    if !timer.Stop() {
                        <-timer.C
                    }
                    timer.Reset(maxWait)
                }
            case <-timer.C:
                flush()
                timer.Reset(maxWait)
            }
        }
    }()
    return out
}

This assembles events into batches of up to batchSize, flushing earlier if maxWait elapses. For database writes, we used batchSize=500, maxWait=50ms — this turned 50k individual writes/sec into 100 batch writes/sec of 500 rows each.

Worker Pools for CPU-Bound Stages

When a stage is CPU-bound, parallelise it with a worker pool:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
func parallelEnrich(input <-chan Event, workers int) <-chan EnrichedEvent {
    out := make(chan EnrichedEvent, workers*16)
    var wg sync.WaitGroup

    for i := 0; i < workers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for event := range input {
                out <- enrich(event)
            }
        }()
    }

    go func() {
        wg.Wait()
        close(out)
    }()

    return out
}

Note: this loses ordering. If downstream stages need events in arrival order, you need ordered fan-out (fan out with indexed workers, fan in with a reorder buffer). This adds complexity and latency. We usually accepted out-of-order processing for enrichment, with ordering restored at write time using event timestamps.

Backpressure: Making It Observable

Silent blocking is the enemy of understanding. Add metrics to every channel:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
type MonitoredChan[T any] struct {
    ch      chan T
    name    string
    metrics *prometheus.GaugeVec
}

func (m *MonitoredChan[T]) Send(v T) {
    m.metrics.WithLabelValues(m.name, "pending").Set(float64(len(m.ch)))
    m.ch <- v
}

We emitted pipeline_queue_depth{stage="enrichment"} every 5 seconds. When enrichment queue depth was consistently at capacity, it identified the bottleneck. When it was always near zero, the stage had spare capacity.

Alternatively, use a non-blocking send with explicit overflow handling:

1
2
3
4
5
6
7
select {
case out <- enriched:
    // sent
default:
    // queue full — apply backpressure or drop with metric
    droppedCounter.Inc()
}

Dropping with a metric is sometimes the right choice for event pipelines with soft real-time requirements. If an enrichment event is 500ms stale and the queue is full, dropping it (and logging) is better than blocking the pipeline and making everything stale.

When to Abandon Channels

At very high throughput, channels become the bottleneck. The overhead is real: a channel send with a goroutine context switch costs ~150-300ns in contention. At 500k events/sec, that’s 75-150ms of overhead per second — significant.

For the hottest path (raw event ingestion), we moved to a ring buffer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// Simplified ring buffer — single producer, single consumer
type RingBuffer struct {
    data     []Event
    mask     uint64
    writePos uint64
    readPos  uint64
}

func NewRingBuffer(size int) *RingBuffer {
    // size must be power of 2
    return &RingBuffer{
        data: make([]Event, size),
        mask: uint64(size - 1),
    }
}

func (r *RingBuffer) TryWrite(e Event) bool {
    pos := r.writePos
    if pos-r.readPos >= uint64(len(r.data)) {
        return false  // full
    }
    r.data[pos&r.mask] = e
    atomic.StoreUint64(&r.writePos, pos+1)
    return true
}

func (r *RingBuffer) TryRead() (Event, bool) {
    pos := r.readPos
    if atomic.LoadUint64(&r.writePos) == pos {
        return Event{}, false  // empty
    }
    e := r.data[pos&r.mask]
    atomic.StoreUint64(&r.readPos, pos+1)
    return e, true
}

This is the Disruptor pattern applied to Go. The consumer spins (or parks briefly) waiting for new events. No locks, no goroutine scheduling. Throughput: ~2-5M events/sec on a single core vs ~500k/sec through a channel.

The cost: spinning wastes CPU when the buffer is empty. We used a hybrid approach: spin for 1µs, then park with runtime.Gosched(), then sleep if still empty. This gave high throughput under load with acceptable idle CPU usage.

Putting It Together

Our actual pipeline for the market event flow:

Kafka Consumer (batch receive)
  → decode + validate (worker pool, 4 workers)
  → ring buffer
  → normalise (single goroutine, CPU-bound, optimised tight loop)
  → collect into batches (500 events, 50ms max wait)
  → enrich from reference data (batch lookup, worker pool)
  → write to ClickHouse (batch insert)
  → publish to downstream Kafka topics (batch produce)

Measured throughput: 180k events/sec sustained, 220k peak, on two vCPUs. Latency from Kafka consume to ClickHouse write: median 40ms, p99 120ms (dominated by the 50ms batch window).

The design decisions that mattered most:

Batching at write boundaries: eliminated 90% of database round trips
Ring buffer for the hot normalisation path: eliminated scheduling overhead on the CPU-bound stage
Queue depth metrics: identified the bottleneck stage immediately (initially enrichment, after a cache was added, ClickHouse writes)
Explicit backpressure: Kafka consumer paused when the ring buffer was full, letting Kafka absorb the backpressure rather than OOM-ing the service

Channels are fine for moderate throughput (< 50k events/sec). Above that, the design decisions compound: batching, worker pool sizing, buffer depths, and knowing which stages need lock-free implementations. Profile before optimising. The bottleneck is almost never where you expect.

The Problem with Naive Pipelines#

Batching at Stage Boundaries#

Worker Pools for CPU-Bound Stages#

Backpressure: Making It Observable#

When to Abandon Channels#

Putting It Together#