Modes of Dataflow: Sending Messages to Your Future Self

Why This Matters

You've learned how to encode data efficiently—JSON, Protobuf, Avro. But encoding is only half the story. The real question is: how does that encoded data flow between processes? And here's the twist: when you write to a database, you're essentially sending a message to your future self. Will future-you understand what past-you wrote?

Real-world relevance: At companies like LinkedIn, Uber, and Netflix, multiple versions of code run simultaneously during rolling deployments. Your database must handle writes from new code and reads from old code—and vice versa. Understanding dataflow modes is essential for designing systems that evolve gracefully.

Learning Objectives

[ ] Understand the three fundamental modes of dataflow: databases, services, and message passing
[ ] Compare compatibility requirements across different dataflow modes
[ ] Apply strategies for preserving unknown fields during schema evolution
[ ] Evaluate when to use archival formats like Avro container files vs Parquet

The Big Picture: Three Ways Data Flows

Whenever you need to send data to another process—whether across the network or to disk—you encode it as bytes. But who encodes, who decodes, and when?

Loading diagram...

Diagram Explanation: Data flows through systems in three primary modes. Databases involve time-separated communication (past writes, future reads). Services are synchronous request-response. Message passing is asynchronous with an intermediary broker.

This section focuses on Dataflow Through Databases—the most subtle and often overlooked pattern.

Dataflow Through Databases: A Message to Your Future Self

When you write to a database, think of it as sending a message to whoever reads that data next. Often, that "someone" is a future version of your own application.

Loading diagram...

Diagram Explanation: You write data today with your current schema. Months or years later, your newer application code must still understand that old data. This is why backward compatibility (new code reads old data) is essential for databases.

The Twist: It's Not Just Backward Compatibility

In production, multiple versions run simultaneously during rolling deployments:

Loading diagram...

Diagram Explanation: During rolling deployments, old and new code instances run simultaneously, all accessing the same database. This means v1.0 code might read data written by v2.0 code. Databases need both backward AND forward compatibility.

| Scenario | Compatibility Needed | |----------|---------------------| | New code reads old data | Backward compatibility ✅ | | Old code reads new data | Forward compatibility ✅ | | During rolling deployment | Both required! |

The Unknown Field Problem

Here's where things get tricky. Say you add an email field in v2.0:

Loading diagram...

Diagram Explanation: When v1.0 code reads a record with unknown fields (email), updates only the fields it knows about (age), and writes back—the unknown field is silently dropped. This is the "unknown field problem" that can cause subtle data loss during rolling deployments.

The Solution: Preserve What You Don't Understand

The encoding formats we studied (Protobuf, Avro) preserve unknown fields by default. But your application code must be careful:

"""
This example demonstrates the unknown field preservation problem from DDIA Chapter 4.
Key insight: Decoding into model objects can lose unknown fields!
"""
 
# ❌ DANGEROUS: Loses unknown fields
class UserModelV1:
    def __init__(self, name: str, age: int):
        self.name = name
        self.age = age
 
def update_user_dangerous(db_record: dict) -> dict:
    """
    Decodes into model object, then re-encodes.
    Any fields not in UserModelV1 are LOST!
    """
    user = UserModelV1(
        name=db_record['name'],
        age=db_record['age']
    )
    # ... update age ...
    user.age = 30
    
    # Re-encode: email field is gone!
    return {'name': user.name, 'age': user.age}
 
 
# ✅ SAFE: Preserves unknown fields
def update_user_safe(db_record: dict) -> dict:
    """
    Works with raw dict, preserving all fields.
    Updates only what we need, leaves rest intact.
    """
    # Keep the original record (including unknown fields)
    updated_record = db_record.copy()
    
    # Update only the fields we care about
    updated_record['age'] = 30
    
    # Unknown fields (like 'email') are preserved!
    return updated_record
 
 
# Example usage
original_record = {
    'name': 'Alice',
    'age': 25,
    'email': 'alice@example.com'  # v2.0 field, unknown to v1.0
}
 
print("Dangerous approach:", update_user_dangerous(original_record))
# Output: {'name': 'Alice', 'age': 30}  ← email LOST!
 
print("Safe approach:", update_user_safe(original_record))
# Output: {'name': 'Alice', 'age': 30, 'email': 'alice@example.com'}  ← email preserved!

Data Outlives Code: The Five-Year Problem

Here's a profound observation: data outlives code.

When you deploy a new application version, the old code is replaced within minutes. But the data in your database? That five-year-old record is still there, encoded with the schema from 2021.

How Databases Handle This

Most relational databases avoid rewriting data by using clever tricks:

| Database | Strategy | |----------|----------| | PostgreSQL, MySQL | Add new columns with NULL default; old rows read as NULL | | LinkedIn Espresso | Uses Avro; schema evolution handles version differences | | MongoDB | Schema-less; each document self-describes | | Data Warehouses | Periodic ETL rewrites data in latest schema |

The database appears to use a single schema, even though underlying storage contains records from many schema versions.

Archival Storage: The Fresh Start

Sometimes you do want to rewrite everything with a consistent schema. This happens during:

Backups to data warehouses
Data lake exports
Analytics pipelines

Loading diagram...

Diagram Explanation: When exporting to archives, you normalize all data to the latest schema. The archive becomes a consistent snapshot—ideal for analytics tools that expect uniform schemas.

Best Formats for Archival

| Format | Best For | Why | |--------|----------|-----| | Avro Container Files | General archival | Schema in header, compact, splittable | | Parquet | Analytics | Column-oriented, excellent compression, query pushdown | | ORC | Hive/Hadoop | Similar to Parquet, Hive-optimized |

Real-World Analogy

Think of database dataflow like a time capsule:

| Concept | Time Capsule Analogy | |---------|---------------------| | Writing to DB | Burying a time capsule with today's newspapers | | Reading from DB | Opening a capsule from 50 years ago | | Backward compatibility | Can you read old newspapers? (Usually yes) | | Forward compatibility | Could 1970s you read a 2025 newspaper? (Mostly yes, some confusion) | | Unknown field problem | Someone in 1970 repacks the capsule but leaves out "weird" items they don't recognize | | Data outlives code | The capsule outlasts the people who buried it |

Future archaeologists (your future code) must understand artifacts (data) from civilizations (code versions) that no longer exist.

The Three Modes: A Preview

This section focused on databases, but there are two more dataflow modes coming:

| Mode | Encoding | Decoding | Timing | |------|----------|----------|--------| | Databases | Writer process | Reader process | Separated by time (seconds to years) | | Services (REST/RPC) | Client | Server (and vice versa) | Synchronous request-response | | Message Passing | Producer | Consumer | Asynchronous, via broker |

Each mode has different compatibility requirements and patterns. Services and message passing are covered in the next sections.

Key Takeaways

Database writes are messages to your future self: When you write data today, some future version of your code will need to read it. Design for that conversation.
Databases need BOTH backward AND forward compatibility: During rolling deployments, old code reads new data (forward) AND new code reads old data (backward).
Beware the unknown field problem: When old code updates a record with new fields, those fields can be silently lost if you decode into model objects. Preserve what you don't understand.
Data outlives code: Your code is replaced in minutes; your data persists for years. Schema evolution isn't optional—it's how you bridge the gap.
Archival is your chance to normalize: When exporting to data warehouses or analytics systems, encode everything with the latest schema. Use Avro containers or Parquet for best results.

Common Pitfalls

| ❌ Misconception | ✅ Reality | |------------------|-----------| | "Backward compatibility is enough for databases" | You also need forward compatibility for rolling deployments—old code must handle new data | | "Our encoding format preserves unknown fields, so we're safe" | Your application code can still lose unknown fields when decoding to model objects | | "All data in my database uses the current schema" | Data persists for years; you have records in every schema version ever deployed | | "I can just run a migration to update all records" | Migrations on large datasets are expensive; most DBs avoid rewriting data | | "Schema evolution is a nice-to-have" | It's mandatory—your 5-year-old data must be readable by today's code | | "Archival and production use the same format" | Archives often use column-oriented formats (Parquet) optimized for analytics, not row-oriented formats |

What's Next?

Databases are just one mode of dataflow. The next sections explore two more patterns:

Dataflow Through Services (REST and RPC): Synchronous communication between clients and servers. How do you evolve APIs without breaking clients?
Message-Passing Dataflow: Asynchronous communication via message brokers like Kafka. How do producers and consumers evolve independently?

Each mode has its own compatibility challenges and solutions. We'll see how the encoding formats we studied (JSON, Protobuf, Avro) apply to each.