Three months after deploying Kafka with Avro schemas and the Confluent Schema Registry, we had a production incident where a schema change caused a downstream consumer to silently produce incorrect output — wrong field values, no error thrown, no monitoring alert triggered.

That incident rewired how the team thought about schema evolution. The tools don’t protect you from all the failure modes. Understanding the rules and building organisational processes around them is what does.

Avro’s Compatibility Rules

Avro defines compatibility in terms of what the reader can do with data written by a different schema version.

Backward compatibility: new schema can read data written by old schema. Forward compatibility: old schema can read data written by new schema. Full compatibility: both of the above.

Schema change             Backward?  Forward?  Full?
────────────────────────────────────────────────────────
Add field with default    ✓          ✓         ✓
Add field without default ✗          ✓         ✗
Remove field with default ✓          ✗         ✗
Remove field              ✗          ✗         ✗
Change field type         mostly ✗   mostly ✗  ✗
Rename field              ✗          ✗         ✗

The Schema Registry enforces your chosen compatibility level when you register a new schema version. If you configure BACKWARD compatibility, it blocks registration of a schema that violates backward compatibility rules.

The Incident

The producer service updated its Avro schema to rename a field: tradeDateexecutionDate. Both names refer to the same concept — the date the trade was executed.

The Schema Registry was configured for FORWARD compatibility (old readers can read new writer). Renaming is not forward-compatible — the old reader doesn’t know about executionDate, so it gets the default value (null). The Schema Registry should have blocked this.

It didn’t, because the team had registered the schema with NONE compatibility mode months earlier for a one-off migration and never changed it back.

The consumer read tradeDate → got null → used null as a date → downstream P&L calculations used a null date → formatted as 1970-01-01 in the report.

The output was wrong but looked plausible. It only surfaced two weeks later during a month-end reconciliation.

The Practices That Would Have Prevented It

1. Never use NONE compatibility mode

NONE is for initial development. The moment data is in production, set to FULL or at minimum FULL_TRANSITIVE.

FULL_TRANSITIVE checks compatibility against every previous version, not just the immediately preceding one. This prevents the “compatible with v3 but not v1” situation where old consumers that have been offline are broken when they restart.

1
2
3
4
# Set compatibility for a subject:
curl -X PUT -H "Content-Type: application/vnd.schemaregistry.v1+json" \
  --data '{"compatibility": "FULL_TRANSITIVE"}' \
  https://registry/config/trades-value

2. Aliases instead of renames

Avro supports field aliases — the reader schema can accept the old field name:

1
2
3
4
5
6
{
  "name": "executionDate",
  "type": ["null", "string"],
  "default": null,
  "aliases": ["tradeDate"]
}

With aliases, a reader using the old schema can still read the new field name. A reader using the new schema can read data written with the old field name. This makes rename backward and forward compatible.

3. Two-phase migration for breaking changes

When you genuinely need a breaking change, the safe approach is:

Phase 1: Add new field alongside old field
  - Schema v2 adds executionDate (with default) alongside tradeDate
  - All producers write both fields
  - All consumers updated to read executionDate (fall back to tradeDate)
  - Deploy, run for N days

Phase 2: Remove old field
  - Schema v3 removes tradeDate
  - Verify no consumer reads tradeDate
  - Register v3, deploy producers
  - Eventually: consumers stop reading tradeDate fallback

This extends the migration timeline from hours to days or weeks. It also means there’s a window where both fields are present, which adds clarity during the transition.

4. Schema review as part of code review

.avsc files live in version control. Schema changes go through the same review process as code. Reviewers check:

  • Is this backward/forward/full compatible?
  • Are there consumers that need updating?
  • Does this require a two-phase migration?
  • Are all new fields given defaults?

5. Consumer test suites that run against old schema versions

The most reliable defence: consumer services have test cases that deserialise data written by older schema versions and assert correctness. If a producer schema change breaks an old reader, the consumer’s tests fail — before deployment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
@Test
void readV1TradeWithV2Schema() {
    // v1 trade written before executionDate field existed:
    byte[] v1Encoded = encodeWithSchema(v1Schema, Map.of(
        "symbol", "EUR/USD",
        "tradeDate", "2018-10-04",
        "quantity", 10_000_000L
    ));

    // Read with v2 schema (which has executionDate with tradeDate alias):
    Trade v2Trade = decodeWithSchema(v2Schema, v1Encoded);

    assertThat(v2Trade.getExecutionDate()).isEqualTo("2018-10-04");
    assertThat(v2Trade.getQuantity()).isEqualTo(10_000_000L);
}

Types of Schema Change and Their Cost

Change typeStrategyCost
Add optional fieldAdd with default, single deployLow
Add required fieldTwo-phase: add optional → make requiredMedium
Rename fieldUse aliases, two-phaseMedium
Remove fieldTwo-phase: deprecate → verify unused → removeMedium
Change field typeAlmost always requires new topic + migrationHigh
Reorder fieldsDon’t — Avro uses name-based matching, order is irrelevantNone
Split field into multipleNew fields + migrationHigh

The Regulated Environment Constraint

One aspect specific to financial institutions: data retention requirements meant we couldn’t just migrate all historical data. Regulatory records in Kafka topics had 7-year retention. Any schema change had to be readable by software that would be running (or rebuilt) 7 years later.

This pushed us strongly towards FULL_TRANSITIVE compatibility: every schema change had to be readable by all previous consumers. The two-phase migration process wasn’t just an engineering best practice — it was a compliance requirement.

The tooling that helped: a custom linter that ran against every .avsc change in CI, checking compatibility against all registered versions for that subject. Schema changes that failed the compatibility check blocked the PR. Combined with mandatory schema review, this reduced the incident rate from “occasionally catastrophic” to “very rare.”