add_batch(): 3.4× Faster Bulk Ingestion in Feather DB
Feather DB Phase 8 ships add_batch() — parallel batch ingest with the GIL released. At scale, 3.4× faster than sequential add() calls. Here's the API, the internals, and when to use it.
The sequential ingestion bottleneck
Ingesting large vector corpora into Feather DB was previously sequential: a Python loop calling db.add(id, vec) for each document. Each call crosses the Python/C++ boundary, acquires the GIL for the pybind11 trampoline, inserts into the HNSW graph, and releases the GIL. At 100k+ documents, this loop becomes the bottleneck.
add_batch(), shipped in Phase 8 of Feather's optimization roadmap, builds the HNSW graph in parallel with the GIL released. The result: ~3.4× faster bulk insert in Python code.
The API
import feather_db as fdb
import numpy as np
db = fdb.DB.open("corpus.feather", dim=768)
# Prepare your data
ids = list(range(10_000))
vecs = np.random.randn(10_000, 768).astype(np.float32)
# Optional: metadata per vector
metas = [fdb.Metadata(importance=0.8) for _ in range(10_000)]
# Single parallel call — GIL released during graph construction
db.add_batch(ids, vecs, metas=metas)
db.save()
add_batch() accepts:
ids: list of int or 1-D int arrayvecs: 2-D float32 numpy array, shape (N, dim)metas: optional list ofMetadataobjects, length N
The call is equivalent to N sequential add() calls but uses a thread pool internally, building HNSW candidate lists in parallel before merging them into the main index.
Benchmark numbers
On a 4-core machine, inserting 50k × 768-dim vectors:
| Method | Time | Speedup |
|---|---|---|
Sequential add() loop | ~34s | 1× |
add_batch() | ~10s | 3.4× |
The speedup scales with core count up to the HNSW construction thread pool size (default: CPU cores - 1). On an 8-core machine, expect ~5–6× over the sequential baseline.
Combined with parallel load
Phase 8 also ships parallel HNSW load via FEATHER_LOAD_THREADS. The full fast-startup pattern:
import os, feather_db as fdb
import numpy as np
os.environ["FEATHER_LOAD_THREADS"] = "8" # parallel cold-start load
db = fdb.DB.open("corpus.feather", dim=768)
# Ingest 100k vectors in one parallel call
ids = list(range(100_000))
vecs = np.load("corpus.npy") # shape: (100_000, 768)
db.add_batch(ids, vecs)
db.save()
When to use add_batch vs add
Use add_batch() whenever you're ingesting more than ~1k vectors at once:
- Corpus ingestion pipelines (PDF chunking, web crawls, document imports)
- Cold-start memory seeding (loading a user's historical data at session start)
- Batch import from CSV / Parquet / database exports
- Benchmark harnesses (LongMemEval, SIFT1M ingest phase)
Use sequential add() for real-time, single-item ingestion where latency per item matters more than throughput — adding a new memory immediately after a conversation turn, for example.
Metadata with add_batch
import feather_db as fdb
import numpy as np
db = fdb.DB.open("corpus.feather", dim=768)
# Assign importance from an external score (e.g. engagement, spend)
scores = np.load("scores.npy") # float array, same length as vecs
metas = []
for score in scores:
m = fdb.Metadata(importance=float(min(1.0, score)))
m.set_attribute("source", "batch_import")
metas.append(m)
ids = list(range(len(scores)))
vecs = np.load("vecs.npy")
db.add_batch(ids, vecs, metas=metas)
Important: use meta.set_attribute(key, value) — not meta.attributes[key] = value. The latter silently does nothing due to a pybind11 copy semantics issue.
Install
pip install feather-db — add_batch() is available from v0.13+ onwards.
GitHub: github.com/feather-store/feather