Scala Akka Actors for Trading Workflows: Promises and Pitfalls

The case for Akka actors in a trading system sounds compelling: isolated mutable state (no shared memory, no locks), message-driven concurrency, built-in supervision hierarchies for fault tolerance, and location transparency for distributed deployments.

We used Akka for the order lifecycle workflow layer — the component that orchestrated the state machine from order received to fill confirmed. Here’s what we learned.

Why Actors for Order Workflow

An order in an FX trading system has a lifecycle:

RECEIVED → VALIDATED → RISK_CHECKED → SENT_TO_VENUE
    → (ACKNOWLEDGED | REJECTED | TIMED_OUT)
    → FILLED | PARTIALLY_FILLED
    → SETTLED

Each transition requires an action (send to risk service, send to venue, update position) and may trigger side effects (send confirmation, update blotter, trigger hedge logic). The state machine has a dozen valid transitions and handling each correctly under concurrent events (fills arriving while a cancel is in flight) was error-prone.

The actor model seemed well-suited: one actor per order, its state is isolated, messages drive transitions, supervision restarts failed actors cleanly.

What Worked Well

State isolation: each order actor owned its state. No locks, no concurrent modification. The order’s state was whatever the actor’s internal variables said it was, updated only by message processing. The class of bugs where two threads concurrently modified order state disappeared.

Supervision hierarchies: when an actor failed (exception during message processing), the supervisor could restart it, stop it, or escalate. For transient failures (connectivity blip to the risk service), restart-with-backoff worked cleanly. The order actor restarted, re-processed the last message, and continued.

Message tracing: since every state change was driven by a message, logging the message stream gave a complete audit trail. Order ORD-123: RECEIVED(at 14:32:11.441), VALIDATED(14:32:11.442), SENT_TO_VENUE(14:32:11.443), FILLED(14:32:11.891). This replaced a sprawling log search with a focused, ordered event trail.

What Was Painful

Mailbox overflow under load: actors process one message at a time. During high market volatility — when fill confirmations arrived faster than the order actors could process them — the actor mailboxes filled up. Akka’s default behaviour is to drop messages when the mailbox is full. We discovered this in production when fills were silently dropped.

Fix: bounded mailbox with explicit drop handling and alerting, plus capacity planning to ensure the actor thread pool could handle peak message rates.

Debugging actor interactions: a stack trace that says akka://system/user/order-supervisor/order-ORD-123 ! FilledMessage is less useful than a stack trace that says line 147 in OrderStateMachine.processMessage(). With actors, the call stack at any point is usually just the actor dispatcher — the actual message send that triggered the problem might be many hops away in the message chain.

We added correlation IDs to all messages and ensured every log line included them. Post-mortem analysis still required reconstructing message flows manually from logs.

Ask pattern latency: Akka’s ask pattern (request-response between actors) creates a temporary actor per request and has non-trivial overhead (~10µs). Fine for workflow orchestration; not acceptable on the hot path. We had to be disciplined about which interactions used tell (fire-and-forget) vs. ask.

The “let it crash” philosophy in a trading context: Akka encourages designing for failure — “let it crash” and restart. For most systems this is sensible. For a trading system where an actor restart means potentially re-sending an order to the venue (a duplicate trade), restarting under failure required much more careful thinking about idempotency and exactly-once semantics.

Benchmark: Actors vs. State Machine

We benchmarked the actor-based order lifecycle against a simple single-threaded state machine with an input queue (our existing approach):

Approach                  Throughput    Latency p99    Memory
──────────────────────────────────────────────────────────────
Actor per order           14,000/s      450µs          Higher
Single-threaded + queue   28,000/s      180µs          Lower

For our workload (a few hundred orders per minute at peak), neither was a bottleneck. But for a higher-volume system, the actor overhead would matter.

The overhead sources: actor creation/GC per order, dispatcher scheduling, mailbox allocation, message serialisation overhead (in cluster mode).

The Verdict

For the order workflow layer — where isolation, auditability, and supervision mattered more than raw throughput — actors were a good fit. We kept them.

For the price normalisation pipeline — where sub-millisecond latency was required and state was simple — we reverted to the Disruptor-based pipeline. Actors were too slow for that use case.

The lesson: the actor model’s benefits (isolation, message audit trail, supervision) are real, but so are its costs (overhead, debugging difficulty, mailbox management). Evaluate against the specific requirements of each component. “Use Akka for everything” is as wrong as “never use Akka.”

Why Actors for Order Workflow#

What Worked Well#

What Was Painful#

Benchmark: Actors vs. State Machine#

The Verdict#

Why Actors for Order Workflow

What Worked Well

What Was Painful

Benchmark: Actors vs. State Machine

The Verdict