Three months after deploying Kafka with Avro schemas and the Confluent Schema Registry, we had a production incident where a schema change caused a downstream consumer to silently produce incorrect output — wrong field values, no error thrown, no monitoring alert triggered.
That incident rewired how the team thought about schema evolution. The tools don’t protect you from all the failure modes. Understanding the rules and building organisational processes around them is what does.
Avro’s Compatibility Rules
Avro defines compatibility in terms of what the reader can do with data written by a different schema version.
Backward compatibility: new schema can read data written by old schema. Forward compatibility: old schema can read data written by new schema. Full compatibility: both of the above.
Schema change Backward? Forward? Full?
────────────────────────────────────────────────────────
Add field with default ✓ ✓ ✓
Add field without default ✗ ✓ ✗
Remove field with default ✓ ✗ ✗
Remove field ✗ ✗ ✗
Change field type mostly ✗ mostly ✗ ✗
Rename field ✗ ✗ ✗
The Schema Registry enforces your chosen compatibility level when you register a new schema version. If you configure BACKWARD compatibility, it blocks registration of a schema that violates backward compatibility rules.
The Incident
The producer service updated its Avro schema to rename a field: tradeDate → executionDate. Both names refer to the same concept — the date the trade was executed.
The Schema Registry was configured for FORWARD compatibility (old readers can read new writer). Renaming is not forward-compatible — the old reader doesn’t know about executionDate, so it gets the default value (null). The Schema Registry should have blocked this.
It didn’t, because the team had registered the schema with NONE compatibility mode months earlier for a one-off migration and never changed it back.
The consumer read tradeDate → got null → used null as a date → downstream P&L calculations used a null date → formatted as 1970-01-01 in the report.
The output was wrong but looked plausible. It only surfaced two weeks later during a month-end reconciliation.
The Practices That Would Have Prevented It
1. Never use NONE compatibility mode
NONE is for initial development. The moment data is in production, set to FULL or at minimum FULL_TRANSITIVE.
FULL_TRANSITIVE checks compatibility against every previous version, not just the immediately preceding one. This prevents the “compatible with v3 but not v1” situation where old consumers that have been offline are broken when they restart.
| |
2. Aliases instead of renames
Avro supports field aliases — the reader schema can accept the old field name:
| |
With aliases, a reader using the old schema can still read the new field name. A reader using the new schema can read data written with the old field name. This makes rename backward and forward compatible.
3. Two-phase migration for breaking changes
When you genuinely need a breaking change, the safe approach is:
Phase 1: Add new field alongside old field
- Schema v2 adds executionDate (with default) alongside tradeDate
- All producers write both fields
- All consumers updated to read executionDate (fall back to tradeDate)
- Deploy, run for N days
Phase 2: Remove old field
- Schema v3 removes tradeDate
- Verify no consumer reads tradeDate
- Register v3, deploy producers
- Eventually: consumers stop reading tradeDate fallback
This extends the migration timeline from hours to days or weeks. It also means there’s a window where both fields are present, which adds clarity during the transition.
4. Schema review as part of code review
.avsc files live in version control. Schema changes go through the same review process as code. Reviewers check:
- Is this backward/forward/full compatible?
- Are there consumers that need updating?
- Does this require a two-phase migration?
- Are all new fields given defaults?
5. Consumer test suites that run against old schema versions
The most reliable defence: consumer services have test cases that deserialise data written by older schema versions and assert correctness. If a producer schema change breaks an old reader, the consumer’s tests fail — before deployment.
| |
Types of Schema Change and Their Cost
| Change type | Strategy | Cost |
|---|---|---|
| Add optional field | Add with default, single deploy | Low |
| Add required field | Two-phase: add optional → make required | Medium |
| Rename field | Use aliases, two-phase | Medium |
| Remove field | Two-phase: deprecate → verify unused → remove | Medium |
| Change field type | Almost always requires new topic + migration | High |
| Reorder fields | Don’t — Avro uses name-based matching, order is irrelevant | None |
| Split field into multiple | New fields + migration | High |
The Regulated Environment Constraint
One aspect specific to financial institutions: data retention requirements meant we couldn’t just migrate all historical data. Regulatory records in Kafka topics had 7-year retention. Any schema change had to be readable by software that would be running (or rebuilt) 7 years later.
This pushed us strongly towards FULL_TRANSITIVE compatibility: every schema change had to be readable by all previous consumers. The two-phase migration process wasn’t just an engineering best practice — it was a compliance requirement.
The tooling that helped: a custom linter that ran against every .avsc change in CI, checking compatibility against all registered versions for that subject. Schema changes that failed the compatibility check blocked the PR. Combined with mandatory schema review, this reduced the incident rate from “occasionally catastrophic” to “very rare.”