Hash Indexes

Starting with What You Already Know

If you've written code in Python, JavaScript, Java, or almost any modern programming language, you've used a hash map (also called a dictionary or hash table). It's that data structure where you store key-value pairs and can look up any value by its key in constant time, O(1). You say "give me the value for key X" and boom, you have it instantly.

Here's the insight that makes hash indexes work: why not use the same idea to index data on disk?

We already have our append-only log file from the previous section. Every time we write a key-value pair, we append it to the file. The problem was that reading required scanning the entire file. But what if we kept a hash map in memory that tells us exactly where each key's value lives in the file?

Loading diagram...

How It Works

The concept is beautifully simple. Your hash map doesn't store the actual values, that would defeat the purpose since values can be huge. Instead, it stores the byte offset in the file where each key's value begins. When you need to read a value:

Look up the key in the hash map → get the byte offset
Seek to that position in the file
Read the value

One disk seek, and you have your data. If the file is already in the operating system's cache, you might not even need disk I/O at all.

class HashIndexedDatabase:
    def __init__(self, filename):
        self.filename = filename
        self.index = {}  # key -> byte offset
        self._rebuild_index()
    
    def _rebuild_index(self):
        """Scan file once at startup to build the index."""
        try:
            with open(self.filename, 'rb') as f:
                while True:
                    offset = f.tell()
                    # Read key length (4 bytes), then key, then value length, then value
                    key_len_bytes = f.read(4)
                    if not key_len_bytes:
                        break
                    key_len = int.from_bytes(key_len_bytes, 'big')
                    key = f.read(key_len).decode('utf-8')
                    val_len = int.from_bytes(f.read(4), 'big')
                    f.read(val_len)  # Skip the value, we just need the offset
                    self.index[key] = offset
        except FileNotFoundError:
            pass
    
    def set(self, key, value):
        """Append to file and update hash index."""
        with open(self.filename, 'ab') as f:
            offset = f.tell()
            key_bytes = key.encode('utf-8')
            val_bytes = value.encode('utf-8')
            # Write: key_length + key + value_length + value
            f.write(len(key_bytes).to_bytes(4, 'big'))
            f.write(key_bytes)
            f.write(len(val_bytes).to_bytes(4, 'big'))
            f.write(val_bytes)
            self.index[key] = offset
    
    def get(self, key):
        """O(1) lookup using hash index!"""
        if key not in self.index:
            return None
        
        with open(self.filename, 'rb') as f:
            f.seek(self.index[key])
            key_len = int.from_bytes(f.read(4), 'big')
            f.read(key_len)  # Skip the key
            val_len = int.from_bytes(f.read(4), 'big')
            return f.read(val_len).decode('utf-8')
 
 
# Usage
db = HashIndexedDatabase("videos.db")
db.set("cat_video_1", '{"views": 1000000, "likes": 50000}')
db.set("cat_video_2", '{"views": 500000, "likes": 25000}')
 
print(db.get("cat_video_1"))  # Instant lookup!

This Is a Real Production System

This might sound too simple to be real, but it's exactly how Bitcask works, the default storage engine for Riak, a distributed database. Bitcask offers incredibly high performance for both reads and writes, with one important constraint: all your keys must fit in RAM.

Think about that trade-off. The keys live in memory, but the values live on disk. If you have a million keys of 100 bytes each, that's only 100 MB of RAM for the index. But your values could be gigabytes, they're on disk, loaded only when needed.

Loading diagram...

The Perfect Use Case: Frequent Updates to Existing Keys

Hash indexes shine when you have lots of writes, but not too many distinct keys. The classic example is tracking video view counts:

Key: Video URL (there are millions, but they fit in RAM)
Value: Play count and statistics
Pattern: Every time someone watches, increment the counter

You're constantly updating values, but the set of keys grows slowly. This is perfect for a hash-indexed append-only log.

# Simulating the video view count use case
 
class VideoTracker:
    def __init__(self):
        self.db = HashIndexedDatabase("video_stats.db")
    
    def record_view(self, video_url):
        """Called every time someone watches a video."""
        # Get current stats (or default)
        import json
        stats_json = self.db.get(video_url)
        if stats_json:
            stats = json.loads(stats_json)
            stats['views'] += 1
        else:
            stats = {'views': 1, 'first_viewed': '2024-01-15'}
        
        # Write updated stats (appends to log, updates index)
        self.db.set(video_url, json.dumps(stats))
    
    def get_views(self, video_url):
        import json
        stats_json = self.db.get(video_url)
        return json.loads(stats_json)['views'] if stats_json else 0
 
 
tracker = VideoTracker()
 
# Millions of view events per day
tracker.record_view("youtube.com/watch?v=cat123")
tracker.record_view("youtube.com/watch?v=cat123")
tracker.record_view("youtube.com/watch?v=dog456")
tracker.record_view("youtube.com/watch?v=cat123")
 
print(tracker.get_views("youtube.com/watch?v=cat123"))  # 3

The Disk Space Problem: Compaction

Wait, if we only append and never overwrite, won't we run out of disk space? Every update creates a new entry, and the old ones just sit there taking up space.

The solution is compaction. Here's how it works:

Break the log into segments (files of a fixed maximum size)
When a segment is full, close it and start a new one
In the background, compact old segments by throwing away outdated entries
Keep only the most recent value for each key

Loading diagram...

Let's trace through a concrete example with video play counts:

Loading diagram...

Merging Segments

Compaction typically shrinks segments significantly (because keys are overwritten many times). So we can go further and merge multiple compacted segments into one:

Loading diagram...

The beautiful thing is that merging happens in a background thread. While the merge is running, the database keeps serving reads and writes using the old segments. Only when the merge is complete do we atomically switch to the new merged segment and delete the old ones.

class SegmentedHashDatabase:
    """Simplified demonstration of segmented storage with compaction."""
    
    def __init__(self, base_path, max_segment_size=1024):
        self.base_path = base_path
        self.max_segment_size = max_segment_size
        self.segments = []  # List of (filename, hash_index) tuples
        self.current_segment = None
        self.current_index = {}
        self._init_new_segment()
    
    def _init_new_segment(self):
        """Start a new segment file."""
        import os
        segment_id = len(self.segments)
        filename = f"{self.base_path}/segment_{segment_id}.db"
        os.makedirs(self.base_path, exist_ok=True)
        
        if self.current_segment:
            # Save old segment
            self.segments.append((self.current_segment, self.current_index))
        
        self.current_segment = filename
        self.current_index = {}
        open(filename, 'wb').close()  # Create empty file
    
    def set(self, key, value):
        """Write to current segment, roll over if too large."""
        import os
        
        # Check if we need a new segment
        if os.path.exists(self.current_segment):
            if os.path.getsize(self.current_segment) > self.max_segment_size:
                self._init_new_segment()
        
        # Append to current segment
        with open(self.current_segment, 'ab') as f:
            offset = f.tell()
            data = f"{key}:{value}\n".encode()
            f.write(data)
            self.current_index[key] = offset
    
    def get(self, key):
        """Check current segment first, then older segments."""
        # Check current segment
        if key in self.current_index:
            return self._read_from_segment(self.current_segment, self.current_index[key])
        
        # Check older segments (newest first)
        for filename, index in reversed(self.segments):
            if key in index:
                return self._read_from_segment(filename, index[key])
        
        return None
    
    def _read_from_segment(self, filename, offset):
        with open(filename, 'rb') as f:
            f.seek(offset)
            line = f.readline().decode().strip()
            return line.split(':', 1)[1]
    
    def compact_and_merge(self):
        """Merge all frozen segments into one compacted segment."""
        if not self.segments:
            return
        
        # Collect all key-value pairs, keeping only the latest
        merged = {}
        for filename, index in self.segments:
            for key in index:
                merged[key] = self._read_from_segment(filename, index[key])
        
        # Write to new merged segment
        new_filename = f"{self.base_path}/merged_{len(self.segments)}.db"
        new_index = {}
        with open(new_filename, 'wb') as f:
            for key, value in merged.items():
                offset = f.tell()
                f.write(f"{key}:{value}\n".encode())
                new_index[key] = offset
        
        # Atomic switch (in real implementation, this needs careful handling)
        old_files = [f for f, _ in self.segments]
        self.segments = [(new_filename, new_index)]
        
        # Delete old segment files
        import os
        for old_file in old_files:
            os.remove(old_file)
        
        print(f"Merged {len(old_files)} segments into 1")

Real-World Implementation Details

Making this simple idea work in production requires handling several tricky details:

Binary Format, Not CSV

CSV is human-readable but slow to parse and wastes space. Real implementations use binary formats: store the length of each string first, then the raw bytes.

Loading diagram...

Deleting Records: Tombstones

How do you delete a key from an append-only log? You can't erase it. Instead, you append a special marker called a tombstone. When compaction runs, it sees the tombstone and knows to discard all previous values for that key.

Loading diagram...

Crash Recovery

When the database restarts, those in-memory hash maps are gone. You could rebuild them by scanning all segment files, but that's slow for large files. Bitcask solves this by periodically saving snapshots of each segment's hash map to disk. On startup, load the snapshots instead of scanning.

import pickle
 
def save_index_snapshot(index, snapshot_path):
    """Save hash index to disk for fast recovery."""
    with open(snapshot_path, 'wb') as f:
        pickle.dump(index, f)
 
def load_index_snapshot(snapshot_path):
    """Load hash index from disk snapshot."""
    try:
        with open(snapshot_path, 'rb') as f:
            return pickle.load(f)
    except FileNotFoundError:
        return None
 
# Usage
index = {'key1': 0, 'key2': 100, 'key3': 200}
save_index_snapshot(index, 'segment_1.snapshot')
 
# After crash, recover quickly
recovered_index = load_index_snapshot('segment_1.snapshot')

Handling Corrupted Writes

What if the database crashes mid-write? You might have half a record at the end of a file. Solution: include checksums with each record. On recovery, verify checksums and discard any corrupted records.

Concurrency Control

Writes must be sequential (you can't have two threads appending at the same position). So most implementations use a single writer thread. But reads can happen concurrently from multiple threads, the segment files are immutable once written.

Loading diagram...

Why Append-Only? Why Not Update In-Place?

You might wonder: why all this complexity? Why not just find the old value and overwrite it? Append-only seems wasteful.

But append-only has profound advantages:

Sequential writes are fast. On spinning hard drives, the disk head moves to the end and writes. No seeking around. Even on SSDs, sequential writes are preferred because of how flash storage works internally.

Crash recovery is simpler. If you crash while overwriting, you might have half old data and half new data, a corrupted mess. With append-only, a crash just means an incomplete record at the end that you can discard.

Concurrency is easier. Immutable segments can be read by anyone without locks. Only the writer needs to coordinate.

No fragmentation. Files don't get holes in them from different-sized overwrites. Each segment is a contiguous sequence of records.

Loading diagram...

The Limitations of Hash Indexes

Hash indexes are powerful but not perfect. There are two major limitations:

1. All Keys Must Fit in Memory

The hash map lives entirely in RAM. If you have billions of keys, you need billions of entries in your hash map. At some point, you run out of memory.

Could you put the hash map on disk? In theory, yes. In practice, it's hard to make it perform well. Hash maps rely on random access (jump to bucket X, check for collision, maybe jump to bucket Y). That's fast in RAM but painfully slow on disk, where every random access means moving the disk head.

2. Range Queries Are Terrible

Want to find all keys between "user:00001" and "user:99999"? With a hash index, you're out of luck. Hash maps don't preserve any ordering, you'd have to look up each of the 99,999 keys individually.

Loading diagram...

Complete Implementation Example

Let's put it all together with a more complete implementation that demonstrates all the concepts:

import os
import struct
import hashlib
from threading import Lock
from typing import Optional, Dict
 
class BitcaskStyleDB:
    """
    A simplified Bitcask-style storage engine demonstrating:
    - Append-only log with binary format
    - In-memory hash index
    - Segment files with compaction
    - Checksums for data integrity
    - Tombstones for deletion
    """
    
    TOMBSTONE = b'\x00DELETED\x00'
    
    def __init__(self, directory: str, max_segment_size: int = 10 * 1024 * 1024):
        self.directory = directory
        self.max_segment_size = max_segment_size
        self.write_lock = Lock()
        
        os.makedirs(directory, exist_ok=True)
        
        # In-memory index: key -> (segment_file, offset)
        self.index: Dict[str, tuple] = {}
        
        # Current segment being written to
        self.current_segment = None
        self.current_segment_file = None
        self.segment_counter = 0
        
        self._recover()
    
    def _recover(self):
        """Rebuild index from segment files on startup."""
        segment_files = sorted([
            f for f in os.listdir(self.directory) 
            if f.startswith('segment_')
        ])
        
        for filename in segment_files:
            self._load_segment_into_index(os.path.join(self.directory, filename))
            self.segment_counter = max(
                self.segment_counter, 
                int(filename.split('_')[1].split('.')[0]) + 1
            )
        
        self._open_new_segment()
    
    def _load_segment_into_index(self, filepath: str):
        """Scan a segment file and add its keys to the index."""
        with open(filepath, 'rb') as f:
            while True:
                offset = f.tell()
                record = self._read_record(f)
                if record is None:
                    break
                key, value = record
                if value == self.TOMBSTONE:
                    self.index.pop(key, None)  # Deletion
                else:
                    self.index[key] = (filepath, offset)
    
    def _open_new_segment(self):
        """Start writing to a new segment file."""
        if self.current_segment_file:
            self.current_segment_file.close()
        
        self.current_segment = os.path.join(
            self.directory, f'segment_{self.segment_counter}.db'
        )
        self.current_segment_file = open(self.current_segment, 'ab')
        self.segment_counter += 1
    
    def _write_record(self, key: str, value: bytes) -> int:
        """Write a record and return its offset."""
        key_bytes = key.encode('utf-8')
        
        # Calculate checksum
        checksum = hashlib.md5(key_bytes + value).digest()[:4]
        
        # Record format: checksum(4) + key_len(4) + key + value_len(4) + value
        offset = self.current_segment_file.tell()
        self.current_segment_file.write(checksum)
        self.current_segment_file.write(struct.pack('>I', len(key_bytes)))
        self.current_segment_file.write(key_bytes)
        self.current_segment_file.write(struct.pack('>I', len(value)))
        self.current_segment_file.write(value)
        self.current_segment_file.flush()
        
        return offset
    
    def _read_record(self, f) -> Optional[tuple]:
        """Read a record from file, verify checksum, return (key, value) or None."""
        checksum = f.read(4)
        if not checksum:
            return None
        
        key_len_bytes = f.read(4)
        if not key_len_bytes:
            return None
        
        key_len = struct.unpack('>I', key_len_bytes)[0]
        key_bytes = f.read(key_len)
        
        value_len = struct.unpack('>I', f.read(4))[0]
        value = f.read(value_len)
        
        # Verify checksum
        expected_checksum = hashlib.md5(key_bytes + value).digest()[:4]
        if checksum != expected_checksum:
            raise ValueError("Checksum mismatch - data corruption detected!")
        
        return key_bytes.decode('utf-8'), value
    
    def set(self, key: str, value: str):
        """Store a key-value pair."""
        with self.write_lock:
            # Check if we need a new segment
            if os.path.getsize(self.current_segment) > self.max_segment_size:
                self._open_new_segment()
            
            offset = self._write_record(key, value.encode('utf-8'))
            self.index[key] = (self.current_segment, offset)
    
    def get(self, key: str) -> Optional[str]:
        """Retrieve a value by key. Returns None if not found."""
        if key not in self.index:
            return None
        
        filepath, offset = self.index[key]
        with open(filepath, 'rb') as f:
            f.seek(offset)
            _, value = self._read_record(f)
            return value.decode('utf-8')
    
    def delete(self, key: str):
        """Delete a key by writing a tombstone."""
        with self.write_lock:
            if key in self.index:
                self._write_record(key, self.TOMBSTONE)
                del self.index[key]
    
    def close(self):
        """Close the database cleanly."""
        if self.current_segment_file:
            self.current_segment_file.close()
 
 
# Usage demonstration
if __name__ == "__main__":
    db = BitcaskStyleDB("/tmp/mydb")
    
    # Write some data
    db.set("user:1", '{"name": "Alice", "email": "alice@example.com"}')
    db.set("user:2", '{"name": "Bob", "email": "bob@example.com"}')
    db.set("user:1", '{"name": "Alice Smith", "email": "alice@example.com"}')
    
    # Read it back
    print(db.get("user:1"))  # Latest value for user:1
    print(db.get("user:2"))
    
    # Delete
    db.delete("user:2")
    print(db.get("user:2"))  # None
    
    db.close()

Key Takeaways

Hash indexes are a powerful and practical indexing strategy with clear trade-offs:

| Aspect | Hash Index Characteristic | |--------|--------------------------| | Write Speed | Excellent (append-only) | | Read Speed | Excellent (O(1) hash lookup + 1 disk seek) | | Memory Requirement | All keys must fit in RAM | | Range Queries | Not supported | | Best For | High write throughput, point lookups, moderate key count | | Real Example | Bitcask (Riak's storage engine) |

Loading diagram...

In the next section, we'll explore indexing structures that overcome these limitations, structures that can handle more keys than fit in memory and support efficient range queries.