What Big-Bank Engineering Taught Me About System Design

I joined the large financial institution expecting to find bureaucracy that slowed down engineering. I did find that. I also found something I didn’t expect: certain constraints imposed by regulation, scale, and risk aversion produced genuinely better engineering decisions than I’d been making at the smaller trading firm.

This is about the non-obvious lessons.

Auditability Is a First-Class Requirement

At the trading firm, we thought about auditability when auditors asked about it — which was rarely. At the bank, every trade, every risk calculation, every configuration change had to produce an immutable, timestamped, attributable record that could be reproduced and queried years later.

This constraint shaped system design in useful ways that have nothing to do with compliance.

When you design for auditability from the start, you naturally end up with:

Event-sourced data models (append-only, the full history is the data)
Structured, machine-readable logs with correlation IDs
Explicit ownership of every component (someone’s name is attached to every change)
Separation of command and query (the command that changed the state is stored separately from the current state)

These are good engineering practices regardless of regulation. The compliance requirement forced us to adopt them where we might have taken shortcuts.

The lesson: externally imposed constraints sometimes push you towards better designs than you’d choose freely. Before dismissing compliance requirements as bureaucracy, ask whether the constraint is actually pointing at something the system should have anyway.

Failure Modes Are Explicit Design Inputs

At a small firm, the question “what happens when X fails?” was answered at the time of the incident. At the bank, it was answered in the design document before any code was written.

Every system design required a section on failure modes: what fails, what the fallback behaviour is, how long the system can operate in degraded mode, and what conditions trigger a manual intervention.

This sounds like documentation overhead. In practice it changed what we built. Thinking through failure modes explicitly at design time:

Revealed dependencies we hadn’t noticed (if service X is unavailable, we can’t do Y — is that acceptable?)
Led to graceful degradation paths that wouldn’t have been built otherwise
Exposed cases where the “simple” design was brittle in non-obvious ways

The discipline of writing down failure modes before building them turned what would have been emergency responses into anticipated conditions with planned handling.

Scale Exposes Assumptions

The bank processed more trades in a day than the trading firm did in a year. This scale exposed assumptions baked into our designs that were never tested at smaller volumes.

The most common pattern: a design that works for 1,000 items and breaks for 10,000 — not because of algorithmic complexity, but because of assumptions that 1,000 items kept invisible.

Examples we encountered:

A process that read all current positions into memory at startup (fine for 2,000 positions; OOM at 200,000)
A reconciliation job that did N+1 queries against a database (fine at small N; unacceptably slow at N=50,000)
A configuration file that listed all supported currency pairs (fine at 50; unwieldy and error-prone at 300)

The lesson: when you’re designing for small scale, future scale is theoretical. When you’re designing for large scale, the N+1 query and the O(n²) loop are immediate problems. Working at scale develops an instinct for these patterns that’s hard to acquire otherwise.

Separation of Concerns Is Commercially Important

At the trading firm, one engineer often owned a component end-to-end — understood every line, made every change. This was efficient and fast.

At the bank, components were owned by teams that were responsible for them across multiple products. The order management system was used by FX trading, equity trading, rates trading, and structured products. A change to make FX better that broke equity was unacceptable regardless of FX’s justification.

This forced explicit interface design. The API between components had to be stable, versioned, and documented. Internal components became real interfaces in the software engineering sense — not just “what works for this caller” but “what this component guarantees to all callers.”

The side effect: components with well-designed interfaces are easier to test, easier to replace, and easier to reason about. The discipline imposed by having multiple consumers produces better interfaces than single-consumer design.

What I Carried Forward

When I moved to the startup, I brought these habits:

Design for failure modes explicitly, in writing, before building
Treat auditability as a feature, not an afterthought
Check assumptions about scale before they become production incidents
Design interfaces for the calling patterns of multiple future callers, even if there’s only one caller today

Some of these slowed down initial development at the startup — writing down failure modes takes time. Most of them paid back quickly. The systems we designed with explicit failure handling were the ones that behaved predictably under load. The ones where we skipped that step were the ones that paged us at 2am.

Bureaucracy slows you down. Constraints sometimes make you better. Learning to tell the difference is most of the job.

Auditability Is a First-Class Requirement#

Failure Modes Are Explicit Design Inputs#

Scale Exposes Assumptions#

Separation of Concerns Is Commercially Important#

What I Carried Forward#

Auditability Is a First-Class Requirement

Failure Modes Are Explicit Design Inputs

Scale Exposes Assumptions

Separation of Concerns Is Commercially Important

What I Carried Forward