add_batch(): 3.4× Faster Bulk Ingestion in Feather DB

The sequential ingestion bottleneck

Ingesting large vector corpora into Feather DB was previously sequential: a Python loop calling db.add(id, vec) for each document. Each call crosses the Python/C++ boundary, acquires the GIL for the pybind11 trampoline, inserts into the HNSW graph, and releases the GIL. At 100k+ documents, this loop becomes the bottleneck.

add_batch(), shipped in Phase 8 of Feather's optimization roadmap, builds the HNSW graph in parallel with the GIL released. The result: ~3.4× faster bulk insert in Python code.

The API

import feather_db as fdb
import numpy as np

db = fdb.DB.open("corpus.feather", dim=768)

# Prepare your data
ids  = list(range(10_000))
vecs = np.random.randn(10_000, 768).astype(np.float32)

# Optional: metadata per vector
metas = [fdb.Metadata(importance=0.8) for _ in range(10_000)]

# Single parallel call — GIL released during graph construction
db.add_batch(ids, vecs, metas=metas)
db.save()

add_batch() accepts:

ids: list of int or 1-D int array
vecs: 2-D float32 numpy array, shape (N, dim)
metas: optional list of Metadata objects, length N

The call is equivalent to N sequential add() calls but uses a thread pool internally, building HNSW candidate lists in parallel before merging them into the main index.

Benchmark numbers

On a 4-core machine, inserting 50k × 768-dim vectors:

Method	Time	Speedup
Sequential `add()` loop	~34s	1×
`add_batch()`	~10s	3.4×

The speedup scales with core count up to the HNSW construction thread pool size (default: CPU cores - 1). On an 8-core machine, expect ~5–6× over the sequential baseline.

Combined with parallel load

Phase 8 also ships parallel HNSW load via FEATHER_LOAD_THREADS. The full fast-startup pattern:

import os, feather_db as fdb
import numpy as np

os.environ["FEATHER_LOAD_THREADS"] = "8"   # parallel cold-start load

db = fdb.DB.open("corpus.feather", dim=768)

# Ingest 100k vectors in one parallel call
ids  = list(range(100_000))
vecs = np.load("corpus.npy")   # shape: (100_000, 768)
db.add_batch(ids, vecs)
db.save()

When to use add_batch vs add

Use add_batch() whenever you're ingesting more than ~1k vectors at once:

Corpus ingestion pipelines (PDF chunking, web crawls, document imports)
Cold-start memory seeding (loading a user's historical data at session start)
Batch import from CSV / Parquet / database exports
Benchmark harnesses (LongMemEval, SIFT1M ingest phase)

Use sequential add() for real-time, single-item ingestion where latency per item matters more than throughput — adding a new memory immediately after a conversation turn, for example.

Metadata with add_batch

import feather_db as fdb
import numpy as np

db = fdb.DB.open("corpus.feather", dim=768)

# Assign importance from an external score (e.g. engagement, spend)
scores = np.load("scores.npy")   # float array, same length as vecs

metas = []
for score in scores:
    m = fdb.Metadata(importance=float(min(1.0, score)))
    m.set_attribute("source", "batch_import")
    metas.append(m)

ids  = list(range(len(scores)))
vecs = np.load("vecs.npy")
db.add_batch(ids, vecs, metas=metas)

Important: use meta.set_attribute(key, value) — not meta.attributes[key] = value. The latter silently does nothing due to a pybind11 copy semantics issue.

Install

pip install feather-db — add_batch() is available from v0.13+ onwards.

GitHub: github.com/feather-store/feather