Apache Avro: The Schema That Reads Your Mind
Discover how Avro eliminates field tags entirely, enabling dynamic schema generation for data pipelines. Learn why writer and reader can use different schemas—and how this powers the Hadoop ecosystem.
Why This Matters
Protocol Buffers gets our data down to 33 bytes. Impressive. But Avro looked at those numbered field tags and asked: "What if we just... didn't?" The result? 32 bytes—and a radically different approach to schema evolution that powers the entire Hadoop ecosystem.
Real-world relevance: Avro is the backbone of data pipelines at LinkedIn, Netflix, and countless data engineering teams. When you're dumping terabytes from a relational database into a data lake, Avro's ability to generate schemas dynamically—without manually assigning tag numbers—is a game-changer.
Learning Objectives
- [ ] Understand how Avro achieves the smallest encoding by eliminating tag numbers entirely
- [ ] Compare the writer's schema vs reader's schema approach to Protobuf's single-schema model
- [ ] Apply Avro's schema evolution rules (default values, union types) to real data pipelines
- [ ] Evaluate when to use Avro vs Protobuf based on use case (Hadoop, dynamic schemas, RPC)
The Big Picture: No Tag Numbers?!
Remember how Protobuf replaced field names with tag numbers? Avro goes further: no field identifiers at all. The encoded data is just values concatenated together.
Diagram Explanation: Avro removes even the tag numbers. The encoding is purely values in schema-defined order. This saves one more byte—and enables a completely different approach to schema evolution.
But wait—if there are no field identifiers, how does the decoder know which value is which?
Answer: Both the encoder and decoder must have access to the schema. The decoder reads values in the order the schema defines them.
Avro Schema: Two Flavors
Avro offers two ways to write schemas:
Avro IDL (Human-Friendly)
record Person {
string userName;
union { null, long } favoriteNumber = null;
array<string> interests;
}JSON Schema (Machine-Friendly)
{
"type": "record",
"name": "Person",
"fields": [
{"name": "userName", "type": "string"},
{"name": "favoriteNumber", "type": ["null", "long"], "default": null},
{"name": "interests", "type": {"type": "array", "items": "string"}}
]
}Notice what's missing: no 1:, 2:, 3: tag numbers! Fields are identified by name, not number.
The Encoding: Pure Values
Here's what our example record looks like in Avro's 32 bytes:
┌────────────────────────────────────────────────────────────────┐
│ AVRO ENCODING (32 bytes) │
├────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌────────────────┐ ┌─────────┐ ┌─────────────────┐│
│ │ len: 6 │ │ "Martin" │ │ union:1 │ │ varint: 1337 ││
│ │ (1 byte)│ │ (6 bytes) │ │ (1 byte)│ │ (2 bytes) ││
│ └─────────┘ └────────────────┘ └─────────┘ └─────────────────┘│
│ │
│ ┌─────────┐ ┌─────────────────────────────────────────────────┐│
│ │ len: 2 │ │ "daydreaming" (11 bytes) + "hacking" (7 bytes) ││
│ │ (1 byte)│ │ (array items with length prefixes) ││
│ └─────────┘ └─────────────────────────────────────────────────┘│
│ │
│ No field names! No tag numbers! Just values in schema order. │
└────────────────────────────────────────────────────────────────┘
The risk: If the reader uses a different schema than the writer, values get misinterpreted. "Martin" might be read as an integer!
The solution: Avro's killer feature—schema resolution.
Writer's Schema vs Reader's Schema
This is where Avro gets clever. Unlike Protobuf (which requires the same schema for reading and writing), Avro explicitly supports different schemas:
Diagram Explanation: The writer encodes with their schema (v1). The reader has a different schema (v2). Avro's library uses BOTH schemas to resolve differences: matching fields by name, filling in defaults for new fields, ignoring removed fields.
How Schema Resolution Works
| Scenario | What Happens | |----------|--------------| | Field in writer & reader | Data transferred normally | | Field in writer only | Ignored by reader | | Field in reader only | Filled with default value | | Field order different | No problem—matched by name | | Field type changed | Converted if compatible |
Schema Evolution Rules
Avro's rules are stricter but more predictable than Protobuf's:
The Golden Rule: Every field you add or remove must have a default value.
Union Types: Avro's Null Handling
Unlike Protobuf's optional keyword, Avro uses union types for nullable fields:
// This field can be null OR a long
union { null, long } favoriteNumber = null;This is more verbose but more explicit:
Diagram Explanation: Protobuf makes everything implicitly optional. Avro requires you to explicitly declare that a field can be null using a union type. This prevents null-related bugs by making nullability visible in the schema.
Important: You can only use null as a default value if null is one of the union branches!
// ✅ Valid - null is in the union
union { null, long } age = null;
// ❌ Invalid - null is NOT in the union
long age = null; // Compile error!Where Does the Schema Live?
If reader and writer need different schemas, how does the reader get the writer's schema?
Diagram Explanation: Avro adapts to different contexts. Files embed the schema once. Databases store schema versions separately. Network connections negotiate at handshake time.
Context 1: Large Files (Hadoop)
Avro's Object Container File format:
┌─────────────────────────────────────┐
│ File Header │
│ ├── Magic bytes ("Obj1") │
│ ├── Writer's schema (JSON) │
│ └── Sync marker │
├─────────────────────────────────────┤
│ Block 1: [record, record, record...] │
├─────────────────────────────────────┤
│ Block 2: [record, record, record...] │
├─────────────────────────────────────┤
│ ... millions more records ... │
└─────────────────────────────────────┘
The schema is written once at the beginning. All records use the same schema. This is perfect for data lakes and batch processing.
Context 2: Database Records
Each record stores a version number:
┌─────────┬─────────────────────────┐
│ v: 42 │ encoded record bytes... │
└─────────┴─────────────────────────┘
A schema registry maps version numbers to schemas:
Version 42 → { "name": "Person", "fields": [...] }
Version 43 → { "name": "Person", "fields": [..., email] }
Context 3: Network (Avro RPC)
Client and server negotiate schema versions at connection time, then use the agreed schema for all messages in that session.
The Killer Feature: Dynamic Schema Generation
Here's where Avro truly shines over Protobuf:
Scenario: You want to export a relational database to a binary format for analysis.
With Protobuf 😰
- Write a
.protofile with field tags - If the database schema changes, manually update the mapping
- Carefully avoid reusing old tag numbers
- Regenerate code, redeploy
With Avro 😎
- Auto-generate Avro schema from database metadata
- Database schema changes? Just regenerate the Avro schema
- Field names match automatically—no manual tag management
- Old readers can still read new data (via schema resolution)
"""
Dynamically generate Avro schema from database table
"""
def generate_avro_schema(table_name, columns):
"""
Auto-generate Avro schema from database metadata.
No manual tag numbers needed!
"""
fields = []
for col_name, col_type in columns:
avro_type = {
'VARCHAR': 'string',
'INTEGER': 'int',
'BIGINT': 'long',
'BOOLEAN': 'boolean',
'TIMESTAMP': {'type': 'long', 'logicalType': 'timestamp-millis'}
}.get(col_type, 'string')
fields.append({
'name': col_name,
'type': ['null', avro_type], # Make nullable
'default': None
})
return {
'type': 'record',
'name': table_name,
'fields': fields
}
# Example: Generate schema from users table
columns = [
('id', 'BIGINT'),
('name', 'VARCHAR'),
('email', 'VARCHAR'),
('created_at', 'TIMESTAMP')
]
schema = generate_avro_schema('users', columns)
print(schema)
# Output:
# {
# 'type': 'record',
# 'name': 'users',
# 'fields': [
# {'name': 'id', 'type': ['null', 'long'], 'default': None},
# {'name': 'name', 'type': ['null', 'string'], 'default': None},
# {'name': 'email', 'type': ['null', 'string'], 'default': None},
# {'name': 'created_at', 'type': ['null', {...}], 'default': None}
# ]
# }This is why Avro dominates in data engineering: schemas can be generated on-the-fly from any metadata source.
Real-World Analogy
Think of Avro like a restaurant order system with separate forms for waiters and kitchen:
| Avro Concept | Restaurant Analogy | |--------------|-------------------| | Writer's schema | Waiter's order pad (v1: appetizer, main, dessert) | | Reader's schema | Kitchen's prep sheet (v2: appetizer, main, dessert, allergies) | | Schema resolution | Manager translates between forms | | Default values | "If no drink specified, assume water" | | No tag numbers | Orders use dish names, not numbers |
The waiter writes with their form. The kitchen reads with their form. The manager (Avro library) handles the translation—matching dish names, filling in defaults for new fields.
Practical Example: Using Avro in Python
"""
This example demonstrates Apache Avro encoding from DDIA Chapter 4.
Key insight: Writer and reader can use different schema versions!
"""
import fastavro
from io import BytesIO
# Writer's schema (v1)
writer_schema = {
"type": "record",
"name": "Person",
"fields": [
{"name": "userName", "type": "string"},
{"name": "favoriteNumber", "type": ["null", "long"], "default": None},
{"name": "interests", "type": {"type": "array", "items": "string"}}
]
}
# Reader's schema (v2 - added email field with default)
reader_schema = {
"type": "record",
"name": "Person",
"fields": [
{"name": "userName", "type": "string"},
{"name": "favoriteNumber", "type": ["null", "long"], "default": None},
{"name": "interests", "type": {"type": "array", "items": "string"}},
{"name": "email", "type": ["null", "string"], "default": None} # NEW!
]
}
# Data to encode
person = {
"userName": "Martin",
"favoriteNumber": 1337,
"interests": ["daydreaming", "hacking"]
}
# Encode with writer's schema
buffer = BytesIO()
fastavro.schemaless_writer(buffer, fastavro.parse_schema(writer_schema), person)
encoded_bytes = buffer.getvalue()
print(f"Avro encoded size: {len(encoded_bytes)} bytes") # ~32 bytes
# Decode with reader's schema (different version!)
buffer = BytesIO(encoded_bytes)
decoded = fastavro.schemaless_reader(
buffer,
fastavro.parse_schema(writer_schema), # Writer's schema
fastavro.parse_schema(reader_schema) # Reader's schema
)
print(f"Decoded: {decoded}")
# Output: {'userName': 'Martin', 'favoriteNumber': 1337,
# 'interests': ['daydreaming', 'hacking'], 'email': None}
# Note: email field filled with default value!
# Compare sizes
import json
json_bytes = json.dumps(person).encode('utf-8')
print(f"\nJSON: {len(json_bytes)} bytes")
print(f"Avro: {len(encoded_bytes)} bytes")
print(f"Savings: {(1 - len(encoded_bytes) / len(json_bytes)) * 100:.0f}%")
# Output:
# JSON: 81 bytes
# Avro: 32 bytes
# Savings: 60%The Merits of Schemas: A Summary
After exploring JSON, MessagePack, Protobuf, and Avro, here's why schema-based binary encoding wins:
| Benefit | Explanation | |---------|-------------| | Compact | 60% smaller than JSON by omitting field names | | Self-Documenting | Schema IS the documentation—always up-to-date | | Compatibility Checking | Schema registry validates changes before deployment | | Type Safety | Generated code provides compile-time checks | | Evolution | Add/remove fields without breaking existing code |
Key Takeaways
-
No tag numbers = dynamic schemas: Avro matches fields by name, enabling auto-generation from database metadata. No manual tag assignment needed.
-
Writer ≠ Reader: Avro explicitly supports different schema versions. The library resolves differences at read time using both schemas.
-
Default values are mandatory: Every field you add or remove must have a default. This ensures old/new code can always read each other's data.
-
Union types for nulls: Instead of
optional, Avro usesunion { null, T }to explicitly declare nullable fields. More verbose, fewer null-related bugs. -
Schema travels with data: In files, it's embedded in the header. In databases, a version number references a schema registry. In RPC, it's negotiated at connection time.
Common Pitfalls
| ❌ Misconception | ✅ Reality |
|------------------|-----------|
| "Avro is just like Protobuf" | Avro has no tag numbers—fundamentally different evolution model based on field names |
| "I can add any field I want" | New fields MUST have default values, or you break backward compatibility |
| "null is always allowed" | Only if your type is a union that includes null: union { null, T } |
| "Avro is slower because of schema resolution" | Resolution happens once per read batch, not per record. Negligible overhead. |
| "I need to regenerate code when schema changes" | Avro works without code generation—great for dynamic languages like Python |
| "Changing field order breaks things" | Schema resolution matches by name, not position. Order changes are safe. |
Interview Prep
Q: How does Avro achieve smaller encoding than Protocol Buffers?
A: Avro eliminates field tag numbers entirely. The encoded data is just values in schema-defined order. The decoder uses the schema to know what each value means. This saves the 1-2 bytes per field that Protobuf uses for tags.
Q: What's the difference between writer's schema and reader's schema in Avro?
A: The writer encodes data with their schema version. The reader decodes with their (potentially different) schema version. Avro's library resolves differences: matching fields by name, filling defaults for missing fields, ignoring unknown fields.
Q: Why is Avro popular in Hadoop/data engineering but Protobuf dominates RPC?
A: Avro excels at dynamic schemas—you can generate schemas from database metadata without manual tag assignment. Perfect for data pipelines. Protobuf's tag numbers require manual management but are simpler for stable APIs with code generation.
Q: What are Avro's schema evolution rules?
A: (1) Every field you add/remove must have a default value. (2) Use union types for nullable fields. (3) Field renames require aliases for backward compatibility. (4) Type changes must be compatible (e.g., int→long is safe, long→int may truncate).
What's Next?
We've now covered the full spectrum of data encoding formats—from JSON's human readability to Avro's schema-driven efficiency. But encoding is only half the story.
The next sections explore how encoded data flows through systems:
- REST and RPC: Synchronous request-response between services
- Message Passing: Asynchronous communication via queues
- Actors: Distributed state and messaging
We'll see how encoding choices (JSON for REST, Protobuf for gRPC, Avro for Kafka) shape entire system architectures.