Skip to main content
Nauman Munir

B-Trees

Deep dive into B-Trees - the most successful data structure in database history, powering indexes in PostgreSQL, MySQL, Oracle, and virtually every relational database for over 50 years.

19 min read
#B-Trees#Database Internals#Storage Engines#Indexes#Data Structures#PostgreSQL#MySQL
Loading audio player...

B-Trees

The Most Successful Data Structure in Database History

While LSM-Trees are gaining popularity, they're the newcomers. The real veteran of database indexing is the B-Tree, introduced in 1970 and described as "ubiquitous" by 1979. More than 50 years later, B-Trees remain the default index implementation in almost every relational database, PostgreSQL, MySQL, Oracle, SQL Server, and many NoSQL databases too.

Like SSTables, B-Trees keep data sorted by key, enabling both fast lookups and efficient range queries. But the underlying philosophy is completely different. Where LSM-Trees write sequentially to variable-size segments, B-Trees read and write fixed-size pages (typically 4 KB), updating them in place. This design maps directly to how disk hardware works, both hard drives and SSDs operate on fixed-size blocks.

Loading diagram...

Understanding Pages and Pointers

Think of a B-Tree as a tree structure, but instead of nodes in memory, we have pages on disk. Each page is identified by an address (like a disk location), and pages can contain references (pointers) to other pages. It's like pointers in a programming language, but pointing to disk locations instead of memory addresses.

The tree has one special page called the root. Every lookup starts at the root and follows page references down through the tree until it reaches the data.

Loading diagram...

How B-Tree Lookups Work

Let's walk through finding a specific key. Say we're looking for key 251 in the tree.

  1. Start at the root page: The root contains keys that act as boundaries (like signposts). We see boundaries at 100, 200, 300. Since 251 is between 200 and 300, we follow the pointer for that range.

  2. Navigate internal pages: The next page breaks down the 200-300 range further. We see boundaries at 220, 250, 280. Since 251 is between 250 and 280, we follow that pointer.

  3. Reach a leaf page: Finally we arrive at a leaf page containing individual keys and their values. We scan this small page to find key 251.

Loading diagram...

The number of child pointers a page can hold is called the branching factor. In the diagrams above, it's small for readability. In real databases, a branching factor of 500 or more is common. Why does this matter? Because higher branching factors mean shallower trees.


The Magic of Logarithmic Depth

Here's why B-Trees are so efficient: a B-Tree with n keys has a depth of only O(log n). With a branching factor of 500, let's calculate:

  • Level 1 (root): 1 page, up to 500 children
  • Level 2: 500 pages, up to 250,000 children
  • Level 3: 250,000 pages, up to 125 million children
  • Level 4: 125 million pages

A four-level B-Tree with 4 KB pages and a branching factor of 500 can store up to 256 terabytes of data. And to find any key, you only need to read 4 pages, that's just 4 disk reads!

Loading diagram...
def btree_depth(num_keys: int, branching_factor: int) -> int:
    """Calculate the minimum depth needed to store num_keys."""
    import math
    if num_keys <= 0:
        return 0
    # At depth d, we can store up to branching_factor^d keys
    return math.ceil(math.log(num_keys) / math.log(branching_factor))
 
 
# Examples
print(f"1 million keys, branching 500: {btree_depth(1_000_000, 500)} levels")
print(f"1 billion keys, branching 500: {btree_depth(1_000_000_000, 500)} levels")
print(f"1 trillion keys, branching 500: {btree_depth(1_000_000_000_000, 500)} levels")
 
# Output:
# 1 million keys, branching 500: 3 levels
# 1 billion keys, branching 500: 4 levels
# 1 trillion keys, branching 500: 5 levels

Updates and Insertions

Updating an existing key is straightforward:

  1. Search for the leaf page containing the key
  2. Modify the value in that page
  3. Write the page back to disk

The page stays in the same location, so all pointers to it remain valid.

Inserting a new key requires finding the right leaf page and adding the key there. But what if the page is full?

Loading diagram...

When a page overflows, we split it into two half-full pages and update the parent page to point to both. If the parent also overflows, it splits too, and this can cascade up to the root. If the root splits, we create a new root, this is the only way a B-Tree grows taller.


B-Tree Implementation

Let's build a simple B-Tree in Python to understand the mechanics:

from typing import Optional, List, Tuple, Any
 
class BTreeNode:
    """A node (page) in the B-Tree."""
 
    def __init__(self, leaf: bool = False):
        self.leaf = leaf
        self.keys: List[Any] = []
        self.values: List[Any] = []  # Only used in leaf nodes
        self.children: List['BTreeNode'] = []  # Only used in internal nodes
 
    def __repr__(self):
        return f"BTreeNode(keys={self.keys}, leaf={self.leaf})"
 
 
class BTree:
    """
    A B-Tree implementation demonstrating the core concepts.
 
    In a real database, nodes would be pages on disk, and we'd use
    page IDs instead of object references.
    """
 
    def __init__(self, order: int = 4):
        """
        order: maximum number of children per node (branching factor).
        A node can have at most (order - 1) keys.
        """
        self.order = order
        self.root = BTreeNode(leaf=True)
 
    def search(self, key) -> Optional[Any]:
        """Search for a key and return its value, or None if not found."""
        return self._search_node(self.root, key)
 
    def _search_node(self, node: BTreeNode, key) -> Optional[Any]:
        """Recursively search for a key starting from node."""
        # Find the position where key should be
        i = 0
        while i < len(node.keys) and key > node.keys[i]:
            i += 1
 
        # Check if we found the key
        if i < len(node.keys) and key == node.keys[i]:
            if node.leaf:
                return node.values[i]
            else:
                # In internal nodes, the key is just a separator
                # The actual value is in a leaf - go right
                return self._search_node(node.children[i + 1], key)
 
        # Key not found in this node
        if node.leaf:
            return None  # Key doesn't exist
        else:
            # Follow the appropriate child pointer
            return self._search_node(node.children[i], key)
 
    def insert(self, key, value):
        """Insert a key-value pair into the B-Tree."""
        root = self.root
 
        # If root is full, split it first
        if len(root.keys) == self.order - 1:
            new_root = BTreeNode(leaf=False)
            new_root.children.append(self.root)
            self._split_child(new_root, 0)
            self.root = new_root
 
        self._insert_non_full(self.root, key, value)
 
    def _insert_non_full(self, node: BTreeNode, key, value):
        """Insert into a node that is guaranteed to have room."""
        i = len(node.keys) - 1
 
        if node.leaf:
            # Find position and insert
            while i >= 0 and key < node.keys[i]:
                i -= 1
            i += 1
            node.keys.insert(i, key)
            node.values.insert(i, value)
        else:
            # Find the child to descend into
            while i >= 0 and key < node.keys[i]:
                i -= 1
            i += 1
 
            # If child is full, split it first
            if len(node.children[i].keys) == self.order - 1:
                self._split_child(node, i)
                if key > node.keys[i]:
                    i += 1
 
            self._insert_non_full(node.children[i], key, value)
 
    def _split_child(self, parent: BTreeNode, child_index: int):
        """Split a full child node into two nodes."""
        order = self.order
        full_child = parent.children[child_index]
        mid = (order - 1) // 2
 
        # Create new node with right half of keys
        new_node = BTreeNode(leaf=full_child.leaf)
 
        # Move keys to new node
        new_node.keys = full_child.keys[mid + 1:]
        promoted_key = full_child.keys[mid]
        full_child.keys = full_child.keys[:mid]
 
        if full_child.leaf:
            # Move values too
            new_node.values = full_child.values[mid + 1:]
            full_child.values = full_child.values[:mid + 1]
            # Keep middle key in left node for leaves
            new_node.keys = full_child.keys[mid:]
            full_child.keys = full_child.keys[:mid]
            new_node.values = full_child.values[mid:]
            full_child.values = full_child.values[:mid]
        else:
            # Move children
            new_node.children = full_child.children[mid + 1:]
            full_child.children = full_child.children[:mid + 1]
 
        # Insert promoted key and new child into parent
        parent.keys.insert(child_index, promoted_key)
        parent.children.insert(child_index + 1, new_node)
 
    def range_query(self, start_key, end_key) -> List[Tuple[Any, Any]]:
        """Return all key-value pairs where start_key <= key <= end_key."""
        results = []
        self._range_query(self.root, start_key, end_key, results)
        return results
 
    def _range_query(self, node: BTreeNode, start_key, end_key, results: List):
        """Recursively collect keys in range."""
        if node.leaf:
            for i, key in enumerate(node.keys):
                if start_key <= key <= end_key:
                    results.append((key, node.values[i]))
        else:
            for i, key in enumerate(node.keys):
                if start_key <= key:
                    self._range_query(node.children[i], start_key, end_key, results)
                if start_key <= key <= end_key:
                    # Key itself might be in the range (for internal nodes that store values)
                    pass
            # Don't forget the rightmost child
            if end_key >= node.keys[-1] if node.keys else True:
                self._range_query(node.children[-1], start_key, end_key, results)
 
    def visualize(self):
        """Print a simple visualization of the tree."""
        self._visualize_node(self.root, 0)
 
    def _visualize_node(self, node: BTreeNode, level: int):
        indent = "  " * level
        node_type = "LEAF" if node.leaf else "INTERNAL"
        print(f"{indent}[{node_type}] keys={node.keys}")
        for child in node.children:
            self._visualize_node(child, level + 1)
 
 
# Usage example
if __name__ == "__main__":
    tree = BTree(order=4)  # Small order for demonstration
 
    # Insert some data
    data = [(10, "ten"), (20, "twenty"), (5, "five"), (15, "fifteen"),
            (25, "twenty-five"), (30, "thirty"), (35, "thirty-five")]
 
    for key, value in data:
        print(f"Inserting {key}...")
        tree.insert(key, value)
        tree.visualize()
        print()
 
    # Search
    print(f"Search for 15: {tree.search(15)}")
    print(f"Search for 99: {tree.search(99)}")
 
    # Range query
    print(f"Range 10-25: {tree.range_query(10, 25)}")

The Danger of In-Place Updates

B-Trees modify pages in place. This is fundamentally different from LSM-Trees' append-only approach, and it introduces challenges.

When you overwrite a page on disk, the hardware has to:

  • On HDD: Move the disk head to the right position, wait for the platter to spin to the right spot, then write
  • On SSD: Erase a large block (SSDs can't overwrite individual bytes), then rewrite

More critically, some operations require updating multiple pages atomically. When you split a page, you need to:

  1. Write the original page (now half-full)
  2. Write the new page (other half)
  3. Update the parent page with the new pointer

What if the database crashes after step 1 but before step 3? You have an orphan page that no one points to, a corrupted index.

Loading diagram...

Write-Ahead Logging (WAL) for Crash Safety

The solution is the Write-Ahead Log (WAL), also called a redo log. Before making any changes to B-Tree pages, you first append the intended changes to a log file. Only after the log is safely on disk do you modify the actual pages.

If the database crashes mid-operation, it can replay the WAL on startup to complete any partial operations or undo them, restoring the B-Tree to a consistent state.

Loading diagram...
import json
import os
from typing import List, Dict, Any
 
class WAL:
    """Write-Ahead Log for crash recovery."""
 
    def __init__(self, path: str):
        self.path = path
        self.log_file = open(path, 'a+')
 
    def log_operation(self, operation: Dict[str, Any]):
        """Log an operation before performing it."""
        entry = json.dumps(operation) + '\n'
        self.log_file.write(entry)
        self.log_file.flush()
        os.fsync(self.log_file.fileno())  # Ensure it's on disk
 
    def mark_complete(self, operation_id: str):
        """Mark an operation as successfully completed."""
        self.log_operation({'type': 'COMPLETE', 'id': operation_id})
 
    def get_incomplete_operations(self) -> List[Dict]:
        """Find operations that were logged but not marked complete."""
        self.log_file.seek(0)
        operations = {}
 
        for line in self.log_file:
            entry = json.loads(line.strip())
            if entry['type'] == 'COMPLETE':
                operations.pop(entry['id'], None)
            else:
                operations[entry['id']] = entry
 
        return list(operations.values())
 
    def clear(self):
        """Clear the log after successful checkpoint."""
        self.log_file.close()
        open(self.path, 'w').close()
        self.log_file = open(self.path, 'a+')
 
 
# Usage example
class DurableBTreePage:
    """A B-Tree page that uses WAL for durability."""
 
    def __init__(self, page_id: int, wal: WAL):
        self.page_id = page_id
        self.wal = wal
        self.data = {}
 
    def update(self, key, value):
        import uuid
        operation_id = str(uuid.uuid4())
 
        # Step 1: Log the intended change
        self.wal.log_operation({
            'id': operation_id,
            'type': 'UPDATE',
            'page_id': self.page_id,
            'key': key,
            'old_value': self.data.get(key),
            'new_value': value
        })
 
        # Step 2: Perform the change
        self.data[key] = value
 
        # Step 3: Mark complete
        self.wal.mark_complete(operation_id)

Concurrency Control with Latches

Another challenge with in-place updates: concurrent access. What if one thread is splitting a page while another is trying to read from it? The reader might see an inconsistent half-updated state.

B-Trees solve this with latches (lightweight locks). Before reading or modifying a page, a thread must acquire the appropriate latch. This ensures mutual exclusion but adds complexity.

LSM-Trees have an advantage here: since segments are immutable once written, readers can access them without locks. The only coordination needed is when swapping old segments for new ones, which can be done atomically.

Loading diagram...

B-Tree Optimizations

B-Trees have been around for 50+ years. Naturally, many optimizations have been developed:

Copy-on-Write (COW)

Instead of modifying pages in place (requiring WAL), write a new copy of any modified page. Then update parent pages to point to the new copy. The old pages remain untouched until all readers are done. LMDB uses this approach.

Loading diagram...

Key Abbreviation

Interior nodes only need keys to act as separators, they don't need the full key. By abbreviating keys, we can fit more of them in each page, increasing the branching factor and reducing tree depth.

def abbreviate_key(key: str, previous_key: str) -> str:
    """
    Create shortest prefix that distinguishes key from previous_key.
    Used in B-Tree internal nodes to save space.
    """
    # Find first differing character
    min_len = min(len(key), len(previous_key))
    for i in range(min_len):
        if key[i] != previous_key[i]:
            return key[:i + 1]
    return key[:min_len + 1]
 
 
# Example
print(abbreviate_key("application", "apple"))      # "appl"
print(abbreviate_key("banana", "apple"))           # "b"
print(abbreviate_key("application", "application")) # "application"

Sequential Leaf Pages

For efficient range scans, it helps if leaf pages are physically close together on disk. Many B-Tree implementations try to allocate nearby leaf pages in adjacent disk locations. However, as the tree grows and pages split, maintaining this layout becomes difficult. LSM-Trees have an advantage here: they rewrite entire segments, naturally keeping sorted data together.

Sibling Pointers

Adding pointers between leaf pages allows efficient range scans without backtracking to parent nodes. To scan from key A to key Z, find the leaf containing A, then follow sibling pointers until you pass Z.

Loading diagram...

B-Tree vs LSM-Tree Summary

Both structures keep data sorted and support range queries. The choice depends on your workload:

| Aspect | B-Tree | LSM-Tree | | ---------------------- | ------------------------ | --------------------------------- | | Write Pattern | In-place updates | Append-only | | Write Performance | Slower (random I/O) | Faster (sequential I/O) | | Read Performance | Faster (single location) | Slower (check multiple levels) | | Concurrency | Requires latches | Simpler (immutable segments) | | Crash Recovery | WAL required | WAL + immutable segments | | Space Amplification| Lower | Higher (during compaction) | | Maturity | 50+ years | ~20 years |

Loading diagram...

Complete Disk-Based B-Tree Example

Here's a more realistic implementation that simulates disk pages:

import struct
from typing import Optional, Dict, Any
import os
 
class DiskPage:
    """Simulates a fixed-size disk page."""
 
    PAGE_SIZE = 4096  # 4 KB
 
    def __init__(self, page_id: int, is_leaf: bool = True):
        self.page_id = page_id
        self.is_leaf = is_leaf
        self.keys: list = []
        self.values: list = []  # For leaf pages
        self.children: list = []  # For internal pages (page IDs)
        self.next_leaf: Optional[int] = None  # Sibling pointer
 
    def is_full(self, max_keys: int) -> bool:
        return len(self.keys) >= max_keys
 
 
class DiskBTree:
    """
    B-Tree with disk-page simulation.
    Demonstrates page-based storage and access patterns.
    """
 
    def __init__(self, directory: str, order: int = 100):
        self.directory = directory
        self.order = order  # Max children per page
        self.max_keys = order - 1
 
        os.makedirs(directory, exist_ok=True)
 
        self.page_cache: Dict[int, DiskPage] = {}
        self.next_page_id = 0
        self.root_id: Optional[int] = None
 
        # Statistics
        self.disk_reads = 0
        self.disk_writes = 0
 
        # Create root
        root = self._allocate_page(is_leaf=True)
        self.root_id = root.page_id
 
    def _allocate_page(self, is_leaf: bool = True) -> DiskPage:
        """Allocate a new page."""
        page = DiskPage(self.next_page_id, is_leaf)
        self.next_page_id += 1
        self.page_cache[page.page_id] = page
        return page
 
    def _read_page(self, page_id: int) -> DiskPage:
        """Read a page from 'disk' (simulated with cache)."""
        self.disk_reads += 1
        if page_id not in self.page_cache:
            raise ValueError(f"Page {page_id} not found")
        return self.page_cache[page_id]
 
    def _write_page(self, page: DiskPage):
        """Write a page to 'disk'."""
        self.disk_writes += 1
        self.page_cache[page.page_id] = page
 
    def search(self, key) -> Optional[Any]:
        """Search for a key, counting disk reads."""
        self.disk_reads = 0
        result = self._search(self.root_id, key)
        print(f"Search required {self.disk_reads} disk reads")
        return result
 
    def _search(self, page_id: int, key) -> Optional[Any]:
        page = self._read_page(page_id)
 
        # Binary search for the key position
        i = self._binary_search(page.keys, key)
 
        if i < len(page.keys) and page.keys[i] == key:
            if page.is_leaf:
                return page.values[i]
 
        if page.is_leaf:
            return None
 
        # Follow child pointer
        child_id = page.children[i]
        return self._search(child_id, key)
 
    def _binary_search(self, keys: list, target) -> int:
        """Find insertion point for target in sorted keys."""
        left, right = 0, len(keys)
        while left < right:
            mid = (left + right) // 2
            if keys[mid] < target:
                left = mid + 1
            else:
                right = mid
        return left
 
    def insert(self, key, value):
        """Insert a key-value pair."""
        root = self._read_page(self.root_id)
 
        if root.is_full(self.max_keys):
            # Root is full, need to split
            new_root = self._allocate_page(is_leaf=False)
            new_root.children.append(self.root_id)
            self._split_child(new_root, 0)
            self.root_id = new_root.page_id
            self._write_page(new_root)
 
        self._insert_non_full(self.root_id, key, value)
 
    def _split_child(self, parent: DiskPage, child_index: int):
        """Split a full child page."""
        child = self._read_page(parent.children[child_index])
        mid = self.max_keys // 2
 
        # Create new page for right half
        new_page = self._allocate_page(is_leaf=child.is_leaf)
 
        # Distribute keys
        new_page.keys = child.keys[mid + 1:]
        promoted_key = child.keys[mid]
        child.keys = child.keys[:mid]
 
        if child.is_leaf:
            new_page.values = child.values[mid + 1:]
            child.values = child.values[:mid + 1]
            new_page.next_leaf = child.next_leaf
            child.next_leaf = new_page.page_id
        else:
            new_page.children = child.children[mid + 1:]
            child.children = child.children[:mid + 1]
 
        # Update parent
        parent.keys.insert(child_index, promoted_key)
        parent.children.insert(child_index + 1, new_page.page_id)
 
        self._write_page(child)
        self._write_page(new_page)
        self._write_page(parent)
 
    def _insert_non_full(self, page_id: int, key, value):
        """Insert into a page that has room."""
        page = self._read_page(page_id)
        i = self._binary_search(page.keys, key)
 
        if page.is_leaf:
            page.keys.insert(i, key)
            page.values.insert(i, value)
            self._write_page(page)
        else:
            child_id = page.children[i]
            child = self._read_page(child_id)
 
            if child.is_full(self.max_keys):
                self._split_child(page, i)
                if key > page.keys[i]:
                    child_id = page.children[i + 1]
 
            self._insert_non_full(child_id, key, value)
 
    def stats(self):
        """Print statistics about the tree."""
        def count_levels(page_id: int) -> int:
            page = self._read_page(page_id)
            if page.is_leaf:
                return 1
            return 1 + count_levels(page.children[0])
 
        def count_pages(page_id: int) -> int:
            page = self._read_page(page_id)
            if page.is_leaf:
                return 1
            return 1 + sum(count_pages(c) for c in page.children)
 
        levels = count_levels(self.root_id)
        pages = count_pages(self.root_id)
 
        print(f"Tree depth: {levels} levels")
        print(f"Total pages: {pages}")
        print(f"Branching factor: up to {self.order}")
 
 
# Demonstration
if __name__ == "__main__":
    tree = DiskBTree("/tmp/btree_demo", order=50)
 
    # Insert a lot of data
    print("Inserting 10,000 keys...")
    for i in range(10000):
        tree.insert(i, f"value_{i}")
 
    tree.stats()
 
    # Search
    print("\nSearching for key 5000:")
    print(f"Found: {tree.search(5000)}")
 
    print("\nSearching for key 9999:")
    print(f"Found: {tree.search(9999)}")

Key Takeaways

| Concept | Description | | -------------------- | -------------------------------------------------------------------- | | Page | Fixed-size block (typically 4 KB) that forms the unit of disk I/O | | Branching Factor | Number of child pointers per page; higher = shallower tree | | O(log n) Depth | A tree with 500-way branching stores billions of keys in 4 levels | | In-Place Updates | Pages are overwritten, unlike LSM's append-only approach | | Page Splitting | When a page overflows, split it and update the parent | | WAL | Write-ahead log for crash recovery | | Latches | Lightweight locks for concurrent access |

B-Trees have earned their place as the dominant index structure for good reason. Their predictable performance, efficient reads, and mature implementations make them the safe default choice for most database workloads. Understanding both B-Trees and LSM-Trees gives you the knowledge to choose the right tool for your specific needs.