SSTables and LSM-Trees

We've seen how hash indexes with an append-only log can provide fast writes and reasonable reads. But that approach has one fundamental limitation: everything must fit in memory. Once your keyspace exceeds RAM, you're in trouble.

There's a small change that can make a huge difference: sort your keys.

Sorted String Tables (SSTables)

Take our segment files from the hash index approach, but now require that keys are stored in sorted order. That's an SSTable, a Sorted String Table.

Loading diagram...

This simple requirement, sorted keys, unlocks several major advantages.

Advantage 1: Efficient Merging

When you need to compact multiple SSTables, you can use the classic merge sort algorithm. Start at the beginning of each file, compare keys, and always write the smallest one to the output.

Loading diagram...

Notice that carrot appears in both input files. When merging, you keep only the most recent value (from SSTable 2, since it's newer). This is how old values get garbage-collected during compaction.

from typing import List, Iterator, Tuple
import heapq
 
def merge_sstables(sstables: List[List[Tuple[str, str]]]) -> Iterator[Tuple[str, str]]:
    """
    Merge multiple sorted SSTables into one.
    Each SSTable is a list of (key, value) tuples, sorted by key.
    Newer SSTables appear later in the list and take precedence.
    """
    # Tag each entry with its SSTable index for precedence
    # Higher index = newer = higher priority
    tagged_iterators = [
        ((key, val, idx) for key, val in sstable)
        for idx, sstable in enumerate(sstables)
    ]
 
    # Merge using a heap
    merged = heapq.merge(*tagged_iterators, key=lambda x: x[0])
 
    current_key = None
    current_val = None
    current_priority = -1
 
    for key, val, priority in merged:
        if key != current_key:
            # New key - emit previous if exists
            if current_key is not None:
                yield current_key, current_val
            current_key = key
            current_val = val
            current_priority = priority
        elif priority > current_priority:
            # Same key but newer SSTable - update value
            current_val = val
            current_priority = priority
 
    # Don't forget the last key
    if current_key is not None:
        yield current_key, current_val
 
 
# Example usage
sstable_1 = [("apple", "v1"), ("carrot", "v2"), ("mango", "v3")]
sstable_2 = [("banana", "v4"), ("carrot", "v5"), ("zebra", "v6")]
 
merged = list(merge_sstables([sstable_1, sstable_2]))
print(merged)
# Output: [('apple', 'v1'), ('banana', 'v4'), ('carrot', 'v5'), ('mango', 'v3'), ('zebra', 'v6')]

This merge is incredibly efficient:

O(n) time where n = total entries across all files
Streams through files sequentially, perfect for disk I/O
Memory needed is just the current entries being compared

Advantage 2: Sparse Index

Here's where it gets clever. With sorted keys, you don't need to index every single key in memory.

Imagine your SSTable has a million keys. With hash indexing, you'd need a million entries in your hash map. But with sorted keys, you can keep a sparse index, just a few keys with their file offsets.

Loading diagram...

To find helicopter:

Look at sparse index: it's between handbag (offset 2048) and mango (offset 5120)
Seek to offset 2048
Scan forward until you find helicopter

from typing import Optional, List, Tuple
import bisect
 
class SparseIndex:
    def __init__(self):
        self.keys: List[str] = []
        self.offsets: List[int] = []
 
    def add(self, key: str, offset: int):
        self.keys.append(key)
        self.offsets.append(offset)
 
    def find_range(self, target: str) -> Tuple[int, Optional[int]]:
        """
        Return (start_offset, end_offset) for where target might be.
        end_offset is None if target is past the last indexed key.
        """
        idx = bisect.bisect_left(self.keys, target)
 
        if idx == 0:
            # Target is before or equal to first indexed key
            return 0, self.offsets[0] if self.keys else None
 
        if idx == len(self.keys):
            # Target is after last indexed key
            return self.offsets[-1], None
 
        # Target is between two indexed keys
        return self.offsets[idx - 1], self.offsets[idx]
 
 
# Example usage
index = SparseIndex()
index.add("apple", 0)
index.add("handbag", 2048)
index.add("mango", 5120)
 
print(index.find_range("helicopter"))  # (2048, 5120)
print(index.find_range("zebra"))       # (5120, None)

You can choose how sparse to make it. One indexed key per 4KB of file data is a reasonable starting point.

Advantage 3: Block Compression

Since you're already doing a small scan within a block, you can group records into compressed blocks. Each sparse index entry points to the start of a compressed block.

Loading diagram...

This saves disk space and reduces I/O bandwidth. A single compressed block is read into memory, decompressed, and scanned.

But Wait, How Do You Sort Incoming Writes?

Here's the tricky part. Writes come in any order: first zebra, then apple, then mango. How do you produce a sorted file on disk?

You can't sort a file efficiently after it's written. Instead, use an in-memory sorted structure:

Writes go into a balanced tree in memory (red-black tree, AVL tree, or skip list)
When this tree exceeds a size threshold, write it out to disk as a new SSTable
The tree is already sorted, so writing is just a sequential dump

This in-memory tree is called a memtable.

Loading diagram...

The Full Picture: LSM-Trees

Now we can see the complete architecture, known as a Log-Structured Merge-Tree (LSM-Tree):

Loading diagram...

Write path:

Write to memtable (fast, in-memory)
When memtable exceeds threshold, write as new SSTable
Periodically merge SSTables in background (compaction)

Read path:

Check memtable first
Check SSTables from newest to oldest
Return first value found (it's the most recent)

Crash recovery: The memtable is in RAM, what if we crash? We keep a separate append-only log on disk (write-ahead log or WAL). Every write goes to the log first. On recovery, replay the log to rebuild the memtable.

A Complete LSM-Tree Implementation

Let's put it all together:

import os
import struct
from typing import Optional, Dict, List, Tuple
from sortedcontainers import SortedDict  # pip install sortedcontainers
 
 
class SSTable:
    """
    A read-only sorted file of key-value pairs.
    Format: [key_len (4 bytes)][key][value_len (4 bytes)][value]...
    """
 
    def __init__(self, filepath: str):
        self.filepath = filepath
        self.sparse_index: List[Tuple[str, int]] = []
        self._build_sparse_index()
 
    def _build_sparse_index(self, block_size: int = 4096):
        """Build a sparse index with one entry per block_size bytes."""
        with open(self.filepath, 'rb') as f:
            offset = 0
            last_indexed = -block_size
 
            while True:
                pos = f.tell()
                data = self._read_record(f)
                if data is None:
                    break
 
                key, _ = data
 
                if pos - last_indexed >= block_size:
                    self.sparse_index.append((key, pos))
                    last_indexed = pos
 
    def _read_record(self, f) -> Optional[Tuple[str, str]]:
        """Read a single key-value record."""
        key_len_bytes = f.read(4)
        if not key_len_bytes:
            return None
        key_len = struct.unpack('>I', key_len_bytes)[0]
        key = f.read(key_len).decode('utf-8')
        val_len = struct.unpack('>I', f.read(4))[0]
        value = f.read(val_len).decode('utf-8')
        return key, value
 
    def get(self, target_key: str) -> Optional[str]:
        """Look up a key using sparse index + scan."""
        if not self.sparse_index:
            return None
 
        # Binary search to find the right block
        start_offset = 0
        for key, offset in self.sparse_index:
            if key > target_key:
                break
            start_offset = offset
 
        # Scan from that offset
        with open(self.filepath, 'rb') as f:
            f.seek(start_offset)
            while True:
                data = self._read_record(f)
                if data is None:
                    break
                key, value = data
                if key == target_key:
                    return value
                if key > target_key:
                    break  # Passed it, not found
 
        return None
 
    def __iter__(self):
        """Iterate through all key-value pairs in sorted order."""
        with open(self.filepath, 'rb') as f:
            while True:
                data = self._read_record(f)
                if data is None:
                    break
                yield data
 
 
class LSMTree:
    """
    A Log-Structured Merge-Tree storage engine.
 
    Features:
    - In-memory memtable (sorted)
    - On-disk SSTables (sorted, immutable)
    - Write-ahead log for durability
    - Background compaction
    """
 
    def __init__(self, directory: str, memtable_threshold: int = 1000):
        self.directory = directory
        self.memtable_threshold = memtable_threshold
 
        os.makedirs(directory, exist_ok=True)
 
        # In-memory sorted structure
        self.memtable: SortedDict = SortedDict()
 
        # Write-ahead log for crash recovery
        self.wal_path = os.path.join(directory, 'wal.log')
        self.wal = open(self.wal_path, 'ab')
 
        # On-disk SSTables (newest first)
        self.sstables: List[SSTable] = []
        self.sstable_counter = 0
 
        self._recover()
 
    def _recover(self):
        """Recover from WAL and load existing SSTables."""
        # Load existing SSTables
        sstable_files = sorted([
            f for f in os.listdir(self.directory)
            if f.startswith('sstable_') and f.endswith('.db')
        ])
 
        for filename in sstable_files:
            self.sstables.append(SSTable(os.path.join(self.directory, filename)))
            self.sstable_counter = max(
                self.sstable_counter,
                int(filename.split('_')[1].split('.')[0]) + 1
            )
 
        # Recover memtable from WAL
        try:
            with open(self.wal_path, 'rb') as f:
                while True:
                    data = self._read_wal_entry(f)
                    if data is None:
                        break
                    key, value = data
                    self.memtable[key] = value
        except FileNotFoundError:
            pass
 
    def _read_wal_entry(self, f) -> Optional[Tuple[str, str]]:
        """Read a WAL entry."""
        key_len_bytes = f.read(4)
        if not key_len_bytes:
            return None
        key_len = struct.unpack('>I', key_len_bytes)[0]
        key = f.read(key_len).decode('utf-8')
        val_len = struct.unpack('>I', f.read(4))[0]
        value = f.read(val_len).decode('utf-8')
        return key, value
 
    def _write_wal_entry(self, key: str, value: str):
        """Append entry to WAL."""
        key_bytes = key.encode('utf-8')
        val_bytes = value.encode('utf-8')
        self.wal.write(struct.pack('>I', len(key_bytes)))
        self.wal.write(key_bytes)
        self.wal.write(struct.pack('>I', len(val_bytes)))
        self.wal.write(val_bytes)
        self.wal.flush()
 
    def set(self, key: str, value: str):
        """Write a key-value pair."""
        # Write to WAL first (durability)
        self._write_wal_entry(key, value)
 
        # Then to memtable
        self.memtable[key] = value
 
        # Flush if memtable is too large
        if len(self.memtable) >= self.memtable_threshold:
            self._flush_memtable()
 
    def _flush_memtable(self):
        """Write memtable to disk as a new SSTable."""
        if not self.memtable:
            return
 
        # Write sorted data to new SSTable
        sstable_path = os.path.join(
            self.directory, f'sstable_{self.sstable_counter:06d}.db'
        )
 
        with open(sstable_path, 'wb') as f:
            for key, value in self.memtable.items():
                key_bytes = key.encode('utf-8')
                val_bytes = value.encode('utf-8')
                f.write(struct.pack('>I', len(key_bytes)))
                f.write(key_bytes)
                f.write(struct.pack('>I', len(val_bytes)))
                f.write(val_bytes)
 
        # Add to SSTable list (newest first)
        self.sstables.insert(0, SSTable(sstable_path))
        self.sstable_counter += 1
 
        # Clear memtable and WAL
        self.memtable.clear()
        self.wal.close()
        os.remove(self.wal_path)
        self.wal = open(self.wal_path, 'ab')
 
        print(f"Flushed memtable to {sstable_path}")
 
    def get(self, key: str) -> Optional[str]:
        """Read a value by key."""
        # Check memtable first (most recent)
        if key in self.memtable:
            return self.memtable[key]
 
        # Check SSTables from newest to oldest
        for sstable in self.sstables:
            value = sstable.get(key)
            if value is not None:
                return value
 
        return None
 
    def compact(self):
        """Merge all SSTables into one."""
        if len(self.sstables) < 2:
            return
 
        print(f"Compacting {len(self.sstables)} SSTables...")
 
        # Merge all SSTables
        merged_path = os.path.join(
            self.directory, f'sstable_{self.sstable_counter:06d}.db'
        )
 
        # Use a dict to deduplicate (newer values overwrite older)
        merged_data: Dict[str, str] = {}
        for sstable in reversed(self.sstables):  # Oldest first
            for key, value in sstable:
                merged_data[key] = value  # Newer overwrites
 
        # Write merged SSTable
        with open(merged_path, 'wb') as f:
            for key in sorted(merged_data.keys()):
                value = merged_data[key]
                key_bytes = key.encode('utf-8')
                val_bytes = value.encode('utf-8')
                f.write(struct.pack('>I', len(key_bytes)))
                f.write(key_bytes)
                f.write(struct.pack('>I', len(val_bytes)))
                f.write(val_bytes)
 
        # Delete old SSTables
        old_paths = [st.filepath for st in self.sstables]
 
        # Replace with merged
        self.sstables = [SSTable(merged_path)]
        self.sstable_counter += 1
 
        for path in old_paths:
            os.remove(path)
 
        print(f"Compacted into {merged_path}")
 
    def close(self):
        """Clean shutdown."""
        self._flush_memtable()
        self.wal.close()
 
 
# Usage example
if __name__ == "__main__":
    db = LSMTree("/tmp/lsmtree_demo", memtable_threshold=5)
 
    # Write some data
    for i in range(12):
        db.set(f"key_{i:03d}", f"value_{i}")
 
    # Read it back
    print(db.get("key_005"))  # Should find it
    print(db.get("key_999"))  # Should return None
 
    # Trigger compaction
    db.compact()
 
    # Verify data after compaction
    print(db.get("key_005"))  # Should still work
 
    db.close()

Who Uses LSM-Trees?

This isn't just academic, LSM-Trees power some of the most widely used databases:

| Database | Usage | | --------------------- | --------------------------------------------------------- | | LevelDB | Google's embedded key-value store | | RocksDB | Facebook's fork of LevelDB, used everywhere | | Cassandra | Distributed wide-column store | | HBase | Hadoop's database (inspired by BigTable) | | Lucene/Elasticsearch | Full-text search (term dictionary uses SSTables) |

Loading diagram...

The term "Log-Structured Merge-Tree" comes from a 1996 paper by Patrick O'Neil. The "log-structured" part refers to the append-only writing pattern. The "merge-tree" part refers to the background merging of sorted segments.

Performance Optimization: Bloom Filters

There's one scenario where LSM-Trees struggle: looking up keys that don't exist.

If a key isn't in the database, you have to check the memtable (fast), then every single SSTable on disk (slow), only to find nothing. With many SSTables, this means many disk reads for a negative result.

The solution is a Bloom filter, a probabilistic data structure that can tell you "this key is definitely NOT in this SSTable" without reading the SSTable. It might have false positives (saying "maybe yes" when the answer is no), but never false negatives. If the Bloom filter says no, you can skip that SSTable entirely.

Loading diagram...

from bitarray import bitarray
import mmh3  # pip install mmh3
 
class BloomFilter:
    """
    A space-efficient probabilistic data structure.
    Can definitively say "no" but only "probably yes".
    """
 
    def __init__(self, size: int = 10000, num_hashes: int = 7):
        self.size = size
        self.num_hashes = num_hashes
        self.bit_array = bitarray(size)
        self.bit_array.setall(0)
 
    def _hashes(self, item: str):
        """Generate multiple hash values for an item."""
        for i in range(self.num_hashes):
            yield mmh3.hash(item, i) % self.size
 
    def add(self, item: str):
        """Add an item to the filter."""
        for pos in self._hashes(item):
            self.bit_array[pos] = 1
 
    def might_contain(self, item: str) -> bool:
        """
        Check if item might be in the set.
        False = definitely not in set
        True = probably in set (might be false positive)
        """
        return all(self.bit_array[pos] for pos in self._hashes(item))
 
 
# Usage with SSTables
class SSTableWithBloom(SSTable):
    def __init__(self, filepath: str):
        super().__init__(filepath)
        self.bloom = BloomFilter()
        self._build_bloom_filter()
 
    def _build_bloom_filter(self):
        for key, _ in self:
            self.bloom.add(key)
 
    def get(self, target_key: str) -> Optional[str]:
        # Fast path: Bloom filter says definitely not here
        if not self.bloom.might_contain(target_key):
            return None
 
        # Slow path: might be here, do actual lookup
        return super().get(target_key)

Compaction Strategies

How do you decide when to merge SSTables and which ones to merge? There are two main strategies:

Size-Tiered Compaction

Merge SSTables of similar sizes together. Newer, smaller SSTables accumulate until you have enough to merge into a larger one. Simple, but can use a lot of disk space temporarily (you need space for old and new SSTables during merge).

Loading diagram...

Leveled Compaction

Organize SSTables into levels. Each level has a size limit. When a level overflows, merge some of its SSTables into the next level. This uses less disk space and provides more predictable read performance, but compaction is more complex.

Loading diagram...

LevelDB and RocksDB use leveled compaction (hence "LevelDB"). Cassandra supports both strategies.

Why LSM-Trees Are Great for Writes

Let's summarize why LSM-Trees excel at write-heavy workloads:

Sequential writes: All disk writes are sequential appends. No random seeks. This is the fastest possible pattern for both HDDs and SSDs.
High throughput: Writes go to memory first, then batch to disk. You're not doing a disk I/O for every single write.
Sorted data enables range queries: Unlike hash indexes, you can efficiently scan all keys in a range.
Handles data larger than memory: The sparse index trick means you only need a fraction of keys in RAM.

Loading diagram...

Key Takeaways

| Concept | Description | | ---------------- | -------------------------------------------------------- | | SSTable | Sorted String Table, a segment file with keys in sorted order | | Memtable | In-memory balanced tree that accepts writes in any order | | WAL | Write-ahead log for crash recovery of memtable | | Sparse Index | Only index some keys; rely on sorting to find others | | Bloom Filter | Quickly reject lookups for non-existent keys | | Compaction | Background merging of SSTables to reclaim space |

The LSM-Tree is one of the most important data structures in modern databases. Its combination of sequential writes, sorted storage, and background compaction provides an excellent balance of write performance and read capability, especially for workloads with more writes than reads.

In the next section, we'll explore B-Trees, the traditional alternative that takes a completely different approach to the same problems.