Thrift and Protocol Buffers: How Google and Facebook Shrunk Data by 60%

Why This Matters

In the last section, we saw MessagePack shrink our JSON from 81 bytes to 66. Impressive? Google and Facebook looked at that and said, "We can do better." Their secret weapon: stop encoding field names entirely. The result? The same data in just 33 bytes—a 60% reduction from JSON.

Real-world relevance: Protocol Buffers powers virtually all of Google's internal services. Thrift runs Facebook's backend. When you're processing billions of requests per day, saving 50 bytes per message translates to petabytes of bandwidth and storage savings. These aren't academic exercises—they're battle-tested in the world's largest distributed systems.

Learning Objectives

[ ] Understand how field tags replace field names to achieve dramatic size reduction
[ ] Compare Thrift's BinaryProtocol, CompactProtocol, and Protocol Buffers encoding strategies
[ ] Apply schema evolution rules to safely add/remove fields without breaking compatibility
[ ] Evaluate the trade-offs between Thrift's list type and Protobuf's repeated marker

The Big Picture: Schema-Based Binary Encoding

The insight that powers both Thrift and Protocol Buffers is simple but profound:

Why encode field names when both sender and receiver already know them?

If you have a schema that defines the structure, you don't need to repeat "userName" in every single message. Instead, you assign each field a tag number and just encode that.

Loading diagram...

Diagram Explanation: Both sender and receiver have the schema. The sender encodes field tag numbers (1, 2, 3) instead of names. The receiver looks up what each number means. This eliminates the biggest source of bloat in JSON.

The Schema: Interface Definition Language (IDL)

Both Thrift and Protocol Buffers use a schema language to define data structures. Here's how you'd define our example record:

Thrift IDL

struct Person {
    1: required string userName,
    2: optional i64 favoriteNumber,
    3: optional list<string> interests
}

Protocol Buffers IDL

message Person {
    required string user_name       = 1;
    optional int64 favorite_number  = 2;
    repeated string interests       = 3;
}

Notice the key elements:

Field tags (1, 2, 3): These are the compact identifiers that replace field names
Types: string, i64/int64, list/repeated
Required/Optional: Validation hints (more on this later)

From these schemas, code generators produce classes in Java, Python, Go, C++, etc. Your application uses these generated classes—type-safe, IDE-friendly, and blazing fast.

Three Encoding Formats Compared

Thrift offers two binary formats, while Protocol Buffers has one. Let's see how they encode the same data:

{
  "userName": "Martin",
  "favoriteNumber": 1337,
  "interests": ["daydreaming", "hacking"]
}

Loading diagram...

Diagram Explanation: Each step shows progressive optimization. BinaryProtocol removes field names. CompactProtocol adds variable-length integers. Protocol Buffers squeezes out one more byte with slightly different bit-packing.

Deep Dive: How Each Format Works

Thrift BinaryProtocol (59 bytes)

The straightforward approach:

Each field has a type indicator (1 byte)
Each field has its tag number (2 bytes)
String lengths use 4 bytes (fixed)
Integers use 8 bytes (fixed)

┌────────────────────────────────────────────────────────────┐
│ Field 1: userName                                           │
├──────────┬──────────┬────────────────────────────────────── │
│ Type: 11 │ Tag: 1   │ Length: 6 │ "Martin"                  │
│ (string) │ (2 bytes)│ (4 bytes) │ (6 bytes)                 │
├──────────┴──────────┴────────────────────────────────────── │
│ Field 2: favoriteNumber                                     │
├──────────┬──────────┬────────────────────────────────────── │
│ Type: 10 │ Tag: 2   │ 1337 (8 bytes, fixed-width int64)     │
│ (i64)    │ (2 bytes)│                                       │
└────────────────────────────────────────────────────────────┘

The problem: Fixed-width integers waste bytes. The number 1337 could fit in 2 bytes, but we're using 8.

Thrift CompactProtocol (34 bytes)

The clever optimizations:

Pack type + tag into a single byte (when tag < 16)
Variable-length integers: Small numbers = fewer bytes

Loading diagram...

Diagram Explanation: Variable-length encoding uses the high bit of each byte to signal "more bytes coming." Numbers -64 to 63 fit in 1 byte. Numbers -8192 to 8191 fit in 2 bytes. This saves massive space for common small values.

Protocol Buffers (33 bytes)

Very similar to CompactProtocol, with slight bit-packing differences. Google's format wins by 1 byte in this example—but the difference is negligible. Both achieve ~60% reduction from JSON.

Field Tags: The Heart of the System

Field tags are the most important concept in schema-based encoding. They're what enable both compactness AND evolution.

Loading diagram...

Diagram Explanation: The encoded data only contains tag numbers, not field names. This means you can rename userName to fullName in your schema without breaking anything—the wire format still says "tag 1". But if you change tag 1 to mean something else, all existing data becomes garbage.

Schema Evolution: Adding and Removing Fields

Real systems evolve. You'll add new fields, deprecate old ones. Here's how to do it safely:

Adding a New Field

// Version 1
message Person {
    required string user_name       = 1;
    optional int64 favorite_number  = 2;
}
 
// Version 2 (added email)
message Person {
    required string user_name       = 1;
    optional int64 favorite_number  = 2;
    optional string email           = 4;  // NEW! Tag 4 (never use 3 again if it existed)
}

Loading diagram...

Diagram Explanation: When old code reads data with tag 4, it simply ignores it (forward compatibility). When new code reads old data without tag 4, the field is just missing/null (backward compatibility). Both directions work!

The Critical Rules

| Action | Rule | Why | |--------|------|-----| | Add field | Must be optional or have default | Old data won't have it | | Remove field | Only remove optional fields | Old data might have it | | Change tag | NEVER | Breaks all existing data | | Reuse tag | NEVER | Old data with that tag becomes corrupted | | Rename field | ✅ Safe | Wire format uses tags, not names |

Datatype Evolution: Here Be Dragons

Changing a field's type is risky. Some changes work, others corrupt data:

Loading diagram...

Diagram Explanation: Widening conversions (32→64 bit) are safe—new code reads old data and fills zeros. Narrowing conversions (64→32 bit) cause truncation—if the 64-bit value doesn't fit in 32 bits, data is silently corrupted.

The Protobuf `repeated` Trick

Protocol Buffers has no list type. Instead, you mark fields as repeated:

repeated string interests = 3;

In the wire format, this just means tag 3 appears multiple times. This enables a neat evolution:

// Version 1: Single value
optional string main_interest = 3;
 
// Version 2: Multiple values
repeated string interests = 3;

New code reading old data: Sees a list with 0 or 1 element ✅
Old code reading new data: Sees only the last element ⚠️ (data loss, but no crash)

Thrift's list<string> is more explicit but doesn't allow this evolution.

Real-World Analogy

Think of field tags like postal codes versus full addresses.

| Approach | Example | Size | |----------|---------|------| | JSON (full address) | "123 Main Street, Springfield, IL" | Long | | Protobuf (postal code) | "62701" | Short |

Both sender and receiver have a shared "address book" (the schema). The sender writes "62701" and the receiver looks it up to find the full address. If Springfield's postal code ever changes, chaos ensues—just like changing a tag number.

Practical Example: Using Protocol Buffers in Python

"""
This example demonstrates Protocol Buffers encoding from DDIA Chapter 4.
Key insight: Schemas + field tags = 60% smaller than JSON
"""
 
# First, define your schema in person.proto:
# 
# syntax = "proto3";
# message Person {
#     string user_name = 1;
#     int64 favorite_number = 2;
#     repeated string interests = 3;
# }
#
# Then run: protoc --python_out=. person.proto
 
# This generates person_pb2.py - import the generated class
# from person_pb2 import Person
 
# For demonstration, let's simulate what happens:
import json
 
# The data we want to encode
person_data = {
    "userName": "Martin",
    "favoriteNumber": 1337,
    "interests": ["daydreaming", "hacking"]
}
 
# JSON encoding (what we've been doing)
json_bytes = json.dumps(person_data).encode('utf-8')
print(f"JSON size: {len(json_bytes)} bytes")  # 81 bytes
 
# Protocol Buffers encoding (simulated - actual protobuf would be):
# person = Person()
# person.user_name = "Martin"
# person.favorite_number = 1337
# person.interests.extend(["daydreaming", "hacking"])
# proto_bytes = person.SerializeToString()
# print(f"Protobuf size: {len(proto_bytes)} bytes")  # 33 bytes
 
# The actual protobuf bytes would look something like:
proto_bytes_simulated = bytes([
    0x0a, 0x06,  # Field 1, wire type 2 (length-delimited), length 6
    0x4d, 0x61, 0x72, 0x74, 0x69, 0x6e,  # "Martin"
    0x10, 0xb9, 0x0a,  # Field 2, wire type 0 (varint), value 1337
    0x1a, 0x0b,  # Field 3, wire type 2, length 11
    0x64, 0x61, 0x79, 0x64, 0x72, 0x65, 0x61, 0x6d, 0x69, 0x6e, 0x67,  # "daydreaming"
    0x1a, 0x07,  # Field 3 again (repeated), length 7
    0x68, 0x61, 0x63, 0x6b, 0x69, 0x6e, 0x67,  # "hacking"
])
 
print(f"Protobuf size: {len(proto_bytes_simulated)} bytes")  # 33 bytes
print(f"Savings: {(1 - len(proto_bytes_simulated) / len(json_bytes)) * 100:.0f}%")
 
# Output:
# JSON size: 81 bytes
# Protobuf size: 33 bytes
# Savings: 59%

Schema Evolution in Action

"""
Demonstrating safe schema evolution with Protocol Buffers
"""
 
# Version 1 schema:
# message Person {
#     string user_name = 1;
#     int64 favorite_number = 2;
# }
 
# Version 2 schema (added email):
# message Person {
#     string user_name = 1;
#     int64 favorite_number = 2;
#     string email = 3;  # NEW FIELD
# }
 
# Old code writes data (no email field)
old_data = {
    "user_name": "Martin",
    "favorite_number": 1337
    # No email field
}
 
# New code reads it - email will just be empty/default
# person = Person()
# person.ParseFromString(old_data_bytes)
# print(person.email)  # Empty string (default)
# print(person.user_name)  # "Martin" - still works!
 
# New code writes data (with email)
new_data = {
    "user_name": "Martin", 
    "favorite_number": 1337,
    "email": "martin@example.com"
}
 
# Old code reads it - simply ignores tag 3 (email)
# It doesn't crash, it just skips bytes it doesn't understand
print("✅ Forward compatible: Old code ignores unknown field 3")
print("✅ Backward compatible: New code handles missing field 3")

Key Takeaways

Field tags are everything: They replace field names in the wire format, saving ~50% of payload size. Tags are immutable—never change or reuse them.
60% smaller than JSON: By eliminating field names and using variable-length integers, Protobuf/Thrift achieve 33 bytes vs JSON's 81 bytes for the same data.
Schema evolution is built-in: Add optional fields with new tag numbers. Old code ignores unknown tags (forward compat). New code handles missing fields (backward compat).
Required vs Optional is validation, not encoding: Both encode identically. required just triggers a runtime check—and makes future removal impossible. Prefer optional.
Code generation = type safety: Unlike JSON parsing, generated Protobuf/Thrift classes give you compile-time type checking, IDE autocompletion, and documentation.

Common Pitfalls

| ❌ Misconception | ✅ Reality | |------------------|-----------| | "I'll change this tag number, NBD" | Changing tag numbers breaks ALL existing data. It's like renaming postal codes. | | "I'll reuse tag 3 for a different field" | Old data with tag 3 will be misinterpreted. Never reuse—skip to the next number. | | "I'll make the new field required" | Required fields can never have missing data. New code reading old data will crash. | | "required gives me better encoding" | Required/optional encode identically. It's purely a validation hint. | | "I can safely change int32 to int64" | Going bigger is safe. Going smaller (64→32) truncates data silently! | | "Protobuf is Google-only technology" | It's open source, supported in 10+ languages, and used everywhere from Kubernetes to gRPC. |

Interview Prep

Q: How do Thrift/Protobuf achieve smaller sizes than JSON?
A: Two key optimizations: (1) Field tags instead of field names—a 1-2 byte number vs a multi-character string. (2) Variable-length integers—small numbers use fewer bytes. Together, these typically achieve 50-60% size reduction.

Q: What are the rules for schema evolution in Protocol Buffers?
A: (1) Never change or reuse tag numbers. (2) New fields must be optional or have defaults (for backward compat). (3) Only remove optional fields (for forward compat). (4) Field names can change freely—only tags matter in the wire format.

Q: What's the difference between Thrift and Protocol Buffers?
A: Very similar! Key differences: (1) Thrift has two binary formats (BinaryProtocol, CompactProtocol); Protobuf has one. (2) Thrift has explicit list<T>; Protobuf uses repeated marker. (3) Protobuf allows optional→repeated evolution; Thrift doesn't.

Q: Why is required considered harmful?
A: Once you mark a field required, you can never remove it—old data will fail validation. And new fields can't be required because old data won't have them. The Protobuf team now recommends avoiding required entirely; proto3 removed it.

What's Next?

Thrift and Protocol Buffers require the exact same schema version for reading and writing (though they handle unknown fields gracefully). But what if the reader and writer have different schema versions—and you need to reconcile them dynamically?

The next section introduces Apache Avro, which takes a radically different approach: schemas are embedded with the data, and the reader can use a different schema from the writer. This enables even more flexible schema evolution, particularly for data warehousing and Hadoop ecosystems.

Why This Matters

Learning Objectives

The Big Picture: Schema-Based Binary Encoding

The Schema: Interface Definition Language (IDL)

Thrift IDL

Protocol Buffers IDL

Three Encoding Formats Compared

Deep Dive: How Each Format Works

Thrift BinaryProtocol (59 bytes)

Thrift CompactProtocol (34 bytes)

Protocol Buffers (33 bytes)

Field Tags: The Heart of the System

Schema Evolution: Adding and Removing Fields

Adding a New Field

The Critical Rules

Datatype Evolution: Here Be Dragons

The Protobuf repeated Trick

Real-World Analogy

Practical Example: Using Protocol Buffers in Python

Schema Evolution in Action

Key Takeaways

Common Pitfalls

Interview Prep

What's Next?

The Protobuf `repeated` Trick