Building the First Production Service at a Startup: Decisions Under Uncertainty

Three months into the startup, the prototype was working and investors were asking for a production timeline. We had a Postgres database, a Python script doing the core business logic, and no infrastructure to speak of.

The decision: rewrite in Go, build proper infrastructure, or ship the Python and iterate? And if we rewrite, what does “proper infrastructure” mean when you have six engineers and four months of runway?

The Rewrite Decision

The Python script had two problems: it was a script (no error handling, no retries, no observability) and the bottleneck was I/O, not computation, which meant Python’s GIL wasn’t actually a constraint.

The case for Go: the team had Go experience, Go’s standard library HTTP and concurrency primitives would handle the I/O efficiently, and the compiled binary was easier to deploy than managing Python environments.

The case against rewriting: we were solving a known problem with a new language and a new codebase. Every hour spent rewriting was an hour not spent on product features.

We rewrote. The reasons that justified it: the Python code had no tests, no error handling, and was structured as a sequential script not a service. The cost of making it production-worthy would have been similar to rewriting it. The rewrite gave us the chance to design for observability and reliability from the start rather than bolting it on.

The rewrite took three weeks. We shipped two weeks later than the Python path would have. That delta was worth it — the Go service had structured logging, request tracing, health checks, and proper error handling from day one.

The Technology Decisions

Database: kept PostgreSQL. We knew it, it was reliable, and the schema was simple enough that it didn’t need horizontal scaling yet. The heuristic: don’t solve a scaling problem you don’t have.

HTTP framework: net/http from the standard library. No third-party web framework. The routing needs were simple; the standard library handled them. This remained the right call — three years later, we’d never needed the framework’s extra features enough to justify the dependency.

Message queue: initially none. Operations were synchronous. Added Kafka six months later when we had an actual async workload that justified it.

Service discovery / config: environment variables from a .env file locally, injected by the deployment system in production. No Consul, no Vault, no service mesh. These were solved when the complexity of the deployment actually required them.

Observability: this we did invest in early. Structured JSON logging from day one (zerolog). Prometheus metrics from day one. Health endpoints (/health, /ready). The cost was two days of setup; the payoff was immediate — we could actually tell what was happening in production.

The Architecture Principles We Agreed On

Deploy without downtime: even with one service, zero-downtime deployment was worth having from the start. Rolling deploys via Kubernetes meant that a bad deploy didn’t take down the service, and rollbacks were fast. Setting this up early meant it was the default for everything we added later.

Fail fast on startup: if the service couldn’t reach the database at startup, it exited immediately with a clear error. No “start anyway and fail on first request.” This made configuration errors obvious immediately rather than discoverable only under traffic.

One thing at a time: the service did one thing. It accepted HTTP requests, processed them with business logic, and returned responses. No cron jobs, no message consumption, no side-channel admin API baked in. Other concerns were separate processes.

No clever code: with a small team moving fast, code that was clever was code that slowed down the next person to read it. We defaulted to the most straightforward implementation and only added complexity when profiling showed a clear need.

What We Got Wrong

Schema migration strategy: we used raw SQL migration files from the start, which was fine. We didn’t version them carefully, and after eight months had a migration file that had been edited in place (rather than adding a new migration). This bit us during a deployment when two engineers had different local schema states. The fix — treating migration files as immutable once applied — was obvious in retrospect.

Too few integration tests: unit tests were easy to write and gave fast feedback. Integration tests (real database, real service) were slower to set up. We skipped them early and paid for it when a change that passed unit tests caused a production regression that only manifested with a real database. Minimum viable integration test coverage would have caught it.

API versioning left too late: the API was versioned from day one (/v1/) but the actual breaking-change policy was undefined. First external integration showed we had a different definition of “breaking change” than our partner. Defining this explicitly and communicating it before the first external integration would have saved the conversation.

The Pattern for Startup Service Decisions

The framework that reduced decision fatigue:

1. What’s the failure mode if we choose the simpler option? For most early decisions, the failure mode of the simpler option was “this becomes painful at 10× current scale.” We weren’t at 10× current scale. Choose simple.

2. Is this reversible? Language and database choices are hard to reverse. Logging format and metric naming are easy to reverse. Invest more thought in the hard-to-reverse decisions.

3. What does the team actually know how to operate? A technology choice that introduces a new operational model (new database paradigm, new deployment system) has a hidden tax in on-call experience. Choose boring technology when the alternative requires new on-call knowledge.

4. Are we solving a present problem or a future one? “We might need horizontal scaling” is a future problem. “We can’t deploy without downtime” is a present problem. Solve the present problem.

The first production service is a stake in the ground. The decisions made under the pressure of an initial launch tend to persist far longer than you expect, because changing them requires a dedicated migration effort. That’s not a reason to over-engineer the first service — it’s a reason to think carefully about the handful of decisions that are genuinely hard to change later, and choose quickly on everything else.

The Rewrite Decision#

The Technology Decisions#

The Architecture Principles We Agreed On#

What We Got Wrong#

The Pattern for Startup Service Decisions#

The Rewrite Decision

The Technology Decisions

The Architecture Principles We Agreed On

What We Got Wrong

The Pattern for Startup Service Decisions