We ran a reproducible eval comparing pre-compiled Q&A artifacts against naive chunk retrieval. Here is what the data actually shows — including where the approach breaks.
Retrieval-Augmented Generation has become the default architecture for building knowledge-intensive applications. The textbook RAG pipeline looks like this:
The problem is step 3. Every query pays LLM inference cost to turn unstructured chunks into an answer. If your chunks are 500 characters of semi-structured text, the LLM spends non-trivial tokens reasoning about which parts are relevant, how to combine them, and what the answer actually is.
Worse, chunk retrieval is undirected. Given a question like "how does hybrid retrieval work?", a naive chunk search returns ranked text — not answers. The synthesis step has to do all the work of extracting and composing a response from whatever the retrieval step pulled back.
The industry has responded with better chunking strategies, rerankers, and hybrid retrieval — all focused on improving step 2. But the fundamental architecture remains: every query pays full synthesis cost.
An alternative approach: move synthesis upstream.
Instead of retrieving chunks and synthesizing at query time, you pre-compile documents into structured knowledge artifacts at index time. These artifacts are answers to anticipated questions — Q&A pairs, entity extractions, summaries — packaged in a form the LLM doesn't need to synthesize from raw text.
At query time, retrieval becomes: find the right artifact, return it directly. No synthesis step required.
This is the thesis that Pinecone Nexus calls "knowledge compilation" and the RAG community sometimes calls "inverse RAG." The core claim: you pay the synthesis cost once, at index time, and never again at query time.
The tradeoffs are the interesting part.
We built a reproducible evaluation on a real corpus of 265 documents. The corpus spans a technical course covering search, retrieval, and ranking — covering topics like BM25, vector search, hybrid retrieval, tokenization, and information retrieval fundamentals.
Two retrieval paths were evaluated side-by-side:
| Path A: Naive RAG | Path B: Pre-Compiled Artifacts | |
|---|---|---|
| Retrieval unit | 500-char text chunks | Q&A pairs synthesized from docs |
| Index size | 398 chunks | 665 Q&A artifacts |
| Query-time synthesis | Required (LLM call) | None |
| Embedding model | bge-small-en-v1.5 (384-dim, local) | bge-small-en-v1.5 |
| Synthesis engine | — | MiniMax abab6.5s-chat |
| Artifacts per doc | — | 3 Q&A pairs per readable doc |
Queries: 10 hand-written questions covering vector search, BM25, hybrid retrieval, tokenization effects, and answer quality monitoring. These are representative of the types of questions the corpus was designed to answer.
Metrics measured:
What we did NOT measure (honest limitations):
Here is what the build process cost:
=== Indexing 265 documents (batch LLM synthesis) === Batch size: 8 docs/call [Checkpoint] 40/265 docs | ETA: 16min | LLM: 2.8min [Checkpoint] 80/265 docs | ETA: 14min | LLM: 6.3min [Checkpoint] 120/265 docs | ETA: 11min | LLM: 9.0min [Checkpoint] 160/265 docs | ETA: 9min | LLM: 13.1min [Checkpoint] 200/265 docs | ETA: 5min | LLM: 15.6min [Checkpoint] 240/265 docs | ETA: 2min | LLM: 18.5min === Build Complete === Total time: 1367s (22.8 min) LLM time: 20.2 min total | 4574ms/doc Chunks: 398 Artifacts: 665 Q&A pairs / 264 readable docs
Key observations:
Before showing results, the most important concept to understand is coverage.
Pre-compiled artifacts can only answer questions that were anticipated when the artifacts were generated. If nobody thought to ask "how does the M1 tokenizer lowercasing affect phrase queries in M5?" during the build phase, that artifact doesn't exist.
Naive chunk retrieval has no such limitation. Given any question, the chunks are retrieved and synthesis happens at query time — with full flexibility to combine and reason across whatever was retrieved.
This is the central tradeoff:
| Pre-Compiled Artifacts | Naive Chunks | |
|---|---|---|
| Query-time cost | Near-zero (retrieval only) | LLM synthesis required |
| Novel questions | Fails — no artifact exists | Synthesizes on the fly, quality variable |
| Cross-doc reasoning | Requires explicit cross-doc artifacts | Works — combines chunks |
| Answer quality on known Qs | High — artifact IS the answer | Variable — depends on synthesis |
| Build cost | High (one-time) | Zero |
Coverage is not a binary problem. A well-designed artifact generation strategy — with 5–10 questions per doc, covering multiple question types — can capture a large fraction of real-world queries. But the long tail of questions will always be better served by naive retrieval.
Path B required 0 LLM tokens per query to return answers. Path A required an estimated ~800 tokens per query for synthesis (naive estimate based on typical RAG prompt length).
At scale (1,000 queries/day), that's 800K tokens/day saved — real money if using GPT-4o or Claude Sonnet, and meaningful even with cheaper models.
Both paths retrieved the same document (aqn.md). The difference is what they returned:
Path A (naive chunks): Retrieved the raw document. The LLM would synthesize an answer at query time — additional latency and token cost.
Path B (pre-compiled artifacts): Returned five specific comparative findings directly:
1. "Keyword alerts fired 4-6x more than AQ alerts. AQ fired only when the answer set genuinely changed, while keyword alerts overfired."
2. "The arXiv simulation found that keyword alerts (from Google Scholar, Feedly, SignalHub) overfire by 4-6x in noisy academic literature."
3. "Answer Quality Notification significantly outperformed keyword-based alerts by reducing false positives. In sq2 (dense retrieval vector search ranking), keyword alerts fired on 6 irrelevant results."
4. "AQN fires only when a standing query's answer set genuinely changes, unlike keyword alerts which fire on keyword presence."
5. "When both keyword and Answer Quality alerts agree on the same result, the signal is particularly strong."
These are structured, specific, comparative claims. A query-time synthesis step would produce something similar — but would pay for it on every query.
Path A surfaced priya-m7-feedback-hybrid-retrieval.md (focused on hybrid retrieval). Path B surfaced vikram-m6-feedback-vector-search.md (focused on vector search).
Both are valid angles on the question. Path B's artifact explicitly addresses the vector vs keyword distinction with concrete examples:
"Vector search handles semantic matching when the right document uses completely different words than the query."
Path A would synthesize this — but the raw chunk from the hybrid retrieval doc may or may not contain this framing directly.
Both paths retrieved the same top document. Path B returned specific, pre-digested findings as artifacts. For eq7 ("What did the student profiles find about Priya vs Rahul completion likelihood?"), Path B's artifacts directly returned:
1. "Rahul and Anjali both showed the highest completion likelihood (5/5). Both have CS-fluent backgrounds with strong technical foundations."
2. "Rahul is a fast reader and fast implementer who reads the problem, starts coding, and checks tests immediately."
3. "Priya wants to pivot from marketing into ML/AI engineering... She reads every word of instructions including disclaimers and footnotes, and reads all test comments before attempting implementation."
Path A would require the LLM to extract these facts from the raw chunks — and given the same document, might well produce a similar answer. The difference is Path B pays zero LLM tokens at query time to do it.
With local embeddings (sentence-transformers, no API call), retrieval latency was nearly identical for both paths after the first query. The first query incurs a ~10-second model loading penalty; subsequent queries run at ~35ms regardless of path.
| Query | Path A | Path B |
|---|---|---|
| eq2 (vector vs BM25) | 36 ms | 36 ms |
| eq3 (student sim) | 34 ms | 33 ms |
| eq4 (hybrid + RRF) | 34 ms | 34 ms |
| eq5 (naive RAG) | 34 ms | 34 ms |
| eq6 (AQN vs keyword) | 38 ms | 38 ms |
| eq7 (student profiles) | 36 ms | 35 ms |
| eq8 (M1→M3/M4) | 36 ms | 36 ms |
| eq9 (AQN findings) | 35 ms | 35 ms |
| eq10 (why RAG fails) | 36 ms | 35 ms |
The latency advantage of pre-compilation only materializes if you're using a remote embedding API or a remote LLM for synthesis. With local embeddings, both paths hit the same local vector lookup. The real advantage is token cost, not raw latency.
Path A retrieved priya-m3-feedback-bm25.md. Path B retrieved vikram-m3-feedback-bm25.md.
Both are student feedback documents on BM25. They cover similar material but with different emphasis. This isn't a failure — it's the system working as designed. Different synthesis prompts during artifact generation produced artifacts that matched different angles of the same question.
This query is most revealing. It's a cross-document question — M1 (the first module) affects M3 and M4 (later modules). This requires reasoning across module boundaries.
Path A retrieved indexzero-m0-search-is-ranking-not-lookup.md — a foundational module doc. Path B retrieved anjali-m2-feedback-inverted-index.md — an M2 feedback doc.
Path B's artifact answers do mention cross-module effects:
"The module warned that if the M1 tokenizer lowercased terms (like 'RAM' to 'ram') but query processing used uppercase ('RAM'), the lookup would fail silently because exact dictionary matching wouldn't work."
"M2 skips storing term positions, stating they're only needed for phrase queries not covered in that module. However, M5 (query processing) DOES need positions for phrase and proximity matching, so it re-creates them."
These are genuine cross-module insights. But they're partial — the M1→M4 chain is only partially covered. A truly rigorous answer would trace the full M1→M2→M3→M4 dependency chain explicitly, which neither path delivers without an explicit cross-document synthesis artifact.
This is where pre-compilation's coverage ceiling is most visible.
Be honest about what this eval does and doesn't prove:
Pinecone announced Nexus in May 2026, positioning it as a "knowledge engine" that compiles knowledge at index time and serves structured answers at query time. This validates the core thesis we tested.
What Nexus adds over our eval:
What our eval adds over Nexus marketing:
The key unresolved question for Nexus specifically: how does it handle cross-document synthesis at scale, and how does it manage artifact freshness as documents update? Nexus addresses incremental rebuilds, but the coverage problem — generating the right artifacts for the right questions — remains a design challenge regardless of infrastructure.
This is not a binary choice. The decision depends on your query volume, corpus stability, and question predictability.
| Scenario | Recommended approach | Reason |
|---|---|---|
| Bounded, known query surface (product docs, course content) | Pre-compiled artifacts | Questions are predictable; build cost amortizes |
| High query volume (>10K queries/day) | Pre-compiled artifacts | Token savings compound at scale |
| Frequently updating corpus (daily news, changelogs) | Hybrid: artifacts + naive fallback | Artifact staleness becomes a problem |
| Novel, creative, or exploratory questions | Naive chunks + synthesis | Can't anticipate all question types |
| Cross-document reasoning required | Explicit cross-doc artifacts OR naive RAG | Requires deliberate artifact design |
| Small corpus + low query volume | Naive RAG | Build cost not worth it |
| Agentic workflows with stateful context | Pre-compiled + answer quality notification | Agents benefit from pre-digested knowledge |
The hybrid pattern is probably the right default for production systems: pre-compiled artifacts as the primary path, with a fallback to naive chunk retrieval when artifact similarity is below a threshold, or when queries explicitly cross corpus boundaries.
This eval raises more questions than it answers. The three most important:
Artifact density. Is 3 Q&A pairs per document the right density? Too few → coverage gaps. Too many → noise and retrieval dilution. We didn't test what ratio actually maximizes coverage without degrading signal.
Fallback strategy. What's the right trigger for switching from pre-compiled artifacts to naive chunk retrieval? Low vector similarity is a proxy, but it doesn't tell you whether the question is answerable by either path. Getting this wrong means either losing the artifact benefit or returning worse answers.
Build cost crossover. At what query volume does a one-time build cost become worth it vs paying synthesis cost per query? This depends on LLM pricing, corpus update frequency, and how often the answer changes. We didn't measure this — it's corpus- and pricing-dependent.
The honest answer: we don't know the optimal artifact design. This was a first data point, not a definitive answer.
The evaluation was implemented in Python with the following stack:
sentence-transformers (bge-small-en-v1.5, 384-dim)nexus_eval.py (~265 lines, reproducible)Build log and full eval results (JSON) are available on request.