The 768-Dimension Bet: Storing Text, Image, and Video in One Unified Vector Space

Architecture Deep Dive · Unified Embedding · May 2026

The Multi-Bucket Default (and Why It Hurts)

Most stacks that retrieve over multiple modalities store them in separate indexes. Text goes to one vector DB, images to another, video transcripts to a third. The reason is correct: most text encoders, image encoders, and video encoders produce incomparable vector spaces. The inner product between a text vector and an image vector from different model families has no semantic meaning.

The downstream consequence is operational pain. Every query becomes three queries plus a merge. Every cross-modal question — "what visual concept does this brief describe?" — requires either re-encoding into a shared space at query time, or hand-rolled heuristics for merging incomparable scores.

The Living Context Engine architecture assumes you can do better. Specifically: that a single, well-trained multimodal embedding model can produce a vector space where text, image, and video coexist with meaningful cross-modal similarity. In 2026, that assumption is finally realistic.

What Gemini Embedding 2 Changed

Google's gemini-embedding-exp-03-07 outputs a 768-dimensional vector for text, image, and video inputs from the same encoder. The training objective explicitly co-locates conceptually-aligned content across modalities — a photo of a sunset and the phrase "warm golden hour palette" land near each other in vector space.

This changes the storage architecture. You no longer need three indexes. You need one index over three modalities, with a modality tag on each node for filtering when you want single-modality results.

The Feather DB Storage Model

Each node in a Feather DB unified-modality store carries:

{
  "id":         "node_a3f9",
  "vector":     [0.012, -0.084, ...],   # 768 floats
  "modality":   "image" | "text" | "video",
  "payload":    { ... },                # arbitrary metadata
  "edges":      [ ... ],                # typed graph edges
  "decay":      { inserted_at, recall_count, ... },
}

The HNSW index is dimension-aware but modality-agnostic. The graph is fully cross-modal — a node of modality "text" can link to a node of modality "image" via a typed edge.

Four Queries That Become Native

The architectural payoff is not just storage simplification. Four query patterns become first-class that were previously awkward or impossible:

1. Cross-Modal Search

"Find images that look like this brief describes." A text query vector retrieves visually-aligned image nodes directly. No re-encoding. No multi-index merge.

results = db.search(brief_text_vec, k=10, modality="image")

2. Same-Ad Coherence

For a single piece of creative — script, hero image, video cut — store all three modalities as separate nodes linked by a variant_of edge. Coherence becomes a measurable scalar: the mean pairwise similarity inside the variant set. Below a threshold, the creative is incoherent and worth flagging.

3. Competitor Threat Detection

Index your strategy briefs (text) alongside competitor creative (text + image + video). A new competitor asset becomes a node. The threat score is its similarity to your own strategy. High similarity = the competitor is encroaching on your position. Low similarity = independent move.

4. Visual-to-Strategy Traversal

A new image asset is encoded and queried. The top-k results return the strategy briefs (text) most aligned with that visual concept, followed by a graph hop to the campaigns those briefs derived. A single call, two modalities, one connected subgraph back.

What Gets Tougher

Two honest costs of going unified:

You commit to one encoder. A unified index is only as good as the embedding model. Mixing model families breaks the assumption — never co-mingle Gemini Embedding 2 vectors with OpenAI Ada vectors in the same index.
Single-modality recall may be marginally worse. A dedicated text encoder fine-tuned for retrieval will usually beat a multimodal encoder on a pure-text benchmark by a few percent. The architectural question is whether that gap matters more than the new cross-modal queries you can now do. For most context-engine workloads, it doesn't.

The Architectural Bet

Storing all modalities in one 768-dimensional space is a bet that modality boundaries inside an AI system are mostly artificial — products of separate model lineages, not separate conceptual spaces. As multimodal encoders continue to improve, that bet looks more and more correct. Feather DB is built around the assumption that the bet pays off — and the architecture makes the cross-modal queries that follow feel native, not bolted on.

Part of the architecture series. Next up: the Living Context Engine series.