Back to Theory
Performance5 min read · June 16, 2026

Parallel HNSW Load in Feather DB: 4.7× Faster Cold Start

Feather DB v0.15 introduced parallel HNSW load via FEATHER_LOAD_THREADS. At 40K vectors × 128-dim, cold start drops from 7.6s to 1.7s — a 4.7× improvement that matters most for serverless functions, frequent pod restarts, and large embedded indexes.

F
Feather DB
Engineering

Why HNSW load time matters

Feather DB stores the HNSW graph, vectors, and metadata in a single .feather file on disk. When you call fdb.DB.open(), the entire index is loaded into memory: vectors are deserialized, the HNSW graph structure is reconstructed, and the BM25 index is rebuilt. Until that's done, no search is possible.

For most persistent server processes, this happens once at startup and then you pay the cost amortized over millions of queries. But in three scenarios, cold start latency is a first-class concern:

  • Serverless functions (AWS Lambda, Google Cloud Functions, Vercel Edge): cold starts happen on every invocation if the function hasn't been warm in the last few minutes. A 7-second cold start before the first query makes agent memory unusable in these environments.
  • Kubernetes pod restarts: rolling deployments, OOM kills, and node drains all cause cold starts. At high deployment frequency, a 7-second startup adds material overhead to your deployment pipeline.
  • Development iteration: if you restart your local dev server after every code change, a slow index load means slower iteration cycles.

What the parallel load does

HNSW graph reconstruction from disk has two phases:

  1. IO phase: deserialize the raw bytes from the .feather file into memory — vector data, edge lists, level assignments.
  2. Graph build phase: for each node, reconstruct the neighbor lists and validate the HNSW invariants (bidirectional edges, level assignments).

The serial implementation processes nodes sequentially in phase 2. The parallel implementation splits the node list across threads, each thread reconstructing its shard of the graph concurrently. Since node neighbor lists are independent (each node's links reference other nodes, but reconstructing node A's neighbor list doesn't require node B's list to be finished first), the graph build phase parallelizes nearly linearly.

Phase 1 (IO) is disk-bound and doesn't benefit from parallelism unless using NVMe with high queue depth. Phase 2 (graph build) is CPU-bound and parallelizes well. At 40K vectors × 128-dim on a modern 8-core CPU:

FEATHER_LOAD_THREADSLoad time (40K × 128-dim)Speedup
1 (serial)7.6s1.0×
24.1s1.9×
42.3s3.3×
81.7s4.5×
161.6s4.7×

The speedup saturates around 8–16 threads because the IO phase and cache contention become the bottleneck. For most deployments, FEATHER_LOAD_THREADS=8 is the optimal setting.

Enabling parallel load

import os
import feather_db as fdb

# Set before importing feather_db or before DB.open()
os.environ["FEATHER_LOAD_THREADS"] = "8"

# Or in your shell / Dockerfile / .env
# export FEATHER_LOAD_THREADS=8

db = fdb.DB.open("memory.feather", dim=768)
# Now loads in ~1.7s instead of 7.6s at 40K vectors

The environment variable must be set before the first DB.open() call that loads the index. Setting it afterward has no effect on an already-loaded index. In a web server, set it in the container startup script or process.env before the module import.

Combining with int8 RAM quantization

Parallel load and int8 RAM quantization (also introduced in v0.15) are complementary optimizations:

  • Parallel load reduces cold start time by parallelizing HNSW graph reconstruction
  • int8 quantization reduces RAM usage by 1.7× by storing vectors as 8-bit integers with per-vector scale factors

Together, they give you the fastest possible startup with the smallest possible memory footprint — the optimal configuration for serverless and memory-constrained deployments:

import os
import feather_db as fdb

# Enable both optimizations
os.environ["FEATHER_LOAD_THREADS"] = "8"

# Open with int8 RAM quantization
db = fdb.DB.open("memory.feather", dim=768, quantize_ram=True)

# Result at 40K vectors × 128-dim:
# - Load time: ~1.7s (vs 7.6s serial float32)
# - RAM: ~35MB (vs ~60MB float32 without quantization)
# - Recall@10: 96.9% (vs 97.2% full precision — negligible delta)

Sizing FEATHER_LOAD_THREADS for your deployment

The right thread count depends on your CPU core count and whether other processes compete for CPU during startup:

Deployment environmentRecommended FEATHER_LOAD_THREADSNotes
Lambda / Cloud Functions (1-2 vCPU)2More threads may contend with the runtime itself
Container with 4 vCPU4Matches available cores
Container with 8+ vCPU8Speedup saturates here
Development laptop (M1/M2 Mac)8High core count makes this worthwhile locally
Bare metal with 32+ cores8–16Beyond 16, IO becomes the bottleneck
# In a Dockerfile
ENV FEATHER_LOAD_THREADS=8
ENV FEATHER_QUANTIZE_RAM=1

# In a Lambda environment variable config
FEATHER_LOAD_THREADS=2

# In a docker-compose.yml
services:
  feather-api:
    environment:
      FEATHER_LOAD_THREADS: "8"

Monitoring load time

import os
import time
import feather_db as fdb

os.environ["FEATHER_LOAD_THREADS"] = "8"

start = time.perf_counter()
db = fdb.DB.open("memory.feather", dim=768)
elapsed = time.perf_counter() - start

print(f"Index loaded in {elapsed:.2f}s")
print(f"Vectors: {db.count()}")
print(f"RAM: ~{db.count() * 768 * 4 / 1e6:.0f}MB (float32 estimate)")

The parallel HNSW load is the single highest-impact change for cold-start-sensitive deployments. Combined with int8 quantization, it makes Feather DB practical in serverless and edge environments where cold start time and memory budget are the primary constraints — without compromising on the recall and latency numbers that define the product.

Install: pip install feather-db · GitHub: github.com/feather-store/feather