Research Notes
Honest writeups on AI systems

Pre-Compiled Knowledge Artifacts vs Naive RAG: An Empirical Evaluation

We ran a reproducible eval comparing pre-compiled Q&A artifacts against naive chunk retrieval. Here is what the data actually shows — including where the approach breaks.

1. The Problem with RAG

Retrieval-Augmented Generation has become the default architecture for building knowledge-intensive applications. The textbook RAG pipeline looks like this:

  1. Ingest documents into a vector database (chunked, embedded)
  2. Retrieve chunks at query time using cosine similarity
  3. Synthesize an answer by feeding the top-k chunks to an LLM

The problem is step 3. Every query pays LLM inference cost to turn unstructured chunks into an answer. If your chunks are 500 characters of semi-structured text, the LLM spends non-trivial tokens reasoning about which parts are relevant, how to combine them, and what the answer actually is.

Worse, chunk retrieval is undirected. Given a question like "how does hybrid retrieval work?", a naive chunk search returns ranked text — not answers. The synthesis step has to do all the work of extracting and composing a response from whatever the retrieval step pulled back.

The industry has responded with better chunking strategies, rerankers, and hybrid retrieval — all focused on improving step 2. But the fundamental architecture remains: every query pays full synthesis cost.

2. Compile-Time Knowledge Shaping

An alternative approach: move synthesis upstream.

Instead of retrieving chunks and synthesizing at query time, you pre-compile documents into structured knowledge artifacts at index time. These artifacts are answers to anticipated questions — Q&A pairs, entity extractions, summaries — packaged in a form the LLM doesn't need to synthesize from raw text.

At query time, retrieval becomes: find the right artifact, return it directly. No synthesis step required.

This is the thesis that Pinecone Nexus calls "knowledge compilation" and the RAG community sometimes calls "inverse RAG." The core claim: you pay the synthesis cost once, at index time, and never again at query time.

The tradeoffs are the interesting part.

3. Methodology

We built a reproducible evaluation on a real corpus of 265 documents. The corpus spans a technical course covering search, retrieval, and ranking — covering topics like BM25, vector search, hybrid retrieval, tokenization, and information retrieval fundamentals.

Two retrieval paths were evaluated side-by-side:

Path A: Naive RAGPath B: Pre-Compiled Artifacts
Retrieval unit500-char text chunksQ&A pairs synthesized from docs
Index size398 chunks665 Q&A artifacts
Query-time synthesisRequired (LLM call)None
Embedding modelbge-small-en-v1.5 (384-dim, local)bge-small-en-v1.5
Synthesis engineMiniMax abab6.5s-chat
Artifacts per doc3 Q&A pairs per readable doc

Queries: 10 hand-written questions covering vector search, BM25, hybrid retrieval, tokenization effects, and answer quality monitoring. These are representative of the types of questions the corpus was designed to answer.

Metrics measured:

What we did NOT measure (honest limitations):

4. Build Economics

Here is what the build process cost:

=== Indexing 265 documents (batch LLM synthesis) ===
Batch size: 8 docs/call

  [Checkpoint]  40/265 docs | ETA: 16min | LLM:  2.8min
  [Checkpoint]  80/265 docs | ETA: 14min | LLM:  6.3min
  [Checkpoint] 120/265 docs | ETA: 11min | LLM:  9.0min
  [Checkpoint] 160/265 docs | ETA:  9min | LLM: 13.1min
  [Checkpoint] 200/265 docs | ETA:  5min | LLM: 15.6min
  [Checkpoint] 240/265 docs | ETA:  2min | LLM: 18.5min

=== Build Complete ===
  Total time:  1367s (22.8 min)
  LLM time:    20.2 min total | 4574ms/doc
  Chunks:      398
  Artifacts:   665 Q&A pairs / 264 readable docs

Key observations:

5. Coverage: The Central Tension

Before showing results, the most important concept to understand is coverage.

Pre-compiled artifacts can only answer questions that were anticipated when the artifacts were generated. If nobody thought to ask "how does the M1 tokenizer lowercasing affect phrase queries in M5?" during the build phase, that artifact doesn't exist.

Naive chunk retrieval has no such limitation. Given any question, the chunks are retrieved and synthesis happens at query time — with full flexibility to combine and reason across whatever was retrieved.

This is the central tradeoff:

Pre-Compiled ArtifactsNaive Chunks
Query-time costNear-zero (retrieval only)LLM synthesis required
Novel questionsFails — no artifact existsSynthesizes on the fly, quality variable
Cross-doc reasoningRequires explicit cross-doc artifactsWorks — combines chunks
Answer quality on known QsHigh — artifact IS the answerVariable — depends on synthesis
Build costHigh (one-time)Zero

Coverage is not a binary problem. A well-designed artifact generation strategy — with 5–10 questions per doc, covering multiple question types — can capture a large fraction of real-world queries. But the long tail of questions will always be better served by naive retrieval.

6. Results: Where Pre-Compilation Helped

Token savings at query time

Path B required 0 LLM tokens per query to return answers. Path A required an estimated ~800 tokens per query for synthesis (naive estimate based on typical RAG prompt length).

At scale (1,000 queries/day), that's 800K tokens/day saved — real money if using GPT-4o or Claude Sonnet, and meaningful even with cheaper models.

eq6: "Compare the AQN approach to keyword alerts for research monitoring."

Both paths retrieved the same document (aqn.md). The difference is what they returned:

Path A (naive chunks): Retrieved the raw document. The LLM would synthesize an answer at query time — additional latency and token cost.

Path B (pre-compiled artifacts): Returned five specific comparative findings directly:

1. "Keyword alerts fired 4-6x more than AQ alerts. AQ fired only when the answer set genuinely changed, while keyword alerts overfired."

2. "The arXiv simulation found that keyword alerts (from Google Scholar, Feedly, SignalHub) overfire by 4-6x in noisy academic literature."

3. "Answer Quality Notification significantly outperformed keyword-based alerts by reducing false positives. In sq2 (dense retrieval vector search ranking), keyword alerts fired on 6 irrelevant results."

4. "AQN fires only when a standing query's answer set genuinely changes, unlike keyword alerts which fire on keyword presence."

5. "When both keyword and Answer Quality alerts agree on the same result, the signal is particularly strong."

These are structured, specific, comparative claims. A query-time synthesis step would produce something similar — but would pay for it on every query.

eq2: "What are the key differences between vector search and BM25?"

Path A surfaced priya-m7-feedback-hybrid-retrieval.md (focused on hybrid retrieval). Path B surfaced vikram-m6-feedback-vector-search.md (focused on vector search).

Both are valid angles on the question. Path B's artifact explicitly addresses the vector vs keyword distinction with concrete examples:

"Vector search handles semantic matching when the right document uses completely different words than the query."

Path A would synthesize this — but the raw chunk from the hybrid retrieval doc may or may not contain this framing directly.

eq3 and eq7: Student profile and completion likelihood queries

Both paths retrieved the same top document. Path B returned specific, pre-digested findings as artifacts. For eq7 ("What did the student profiles find about Priya vs Rahul completion likelihood?"), Path B's artifacts directly returned:

1. "Rahul and Anjali both showed the highest completion likelihood (5/5). Both have CS-fluent backgrounds with strong technical foundations."

2. "Rahul is a fast reader and fast implementer who reads the problem, starts coding, and checks tests immediately."

3. "Priya wants to pivot from marketing into ML/AI engineering... She reads every word of instructions including disclaimers and footnotes, and reads all test comments before attempting implementation."

Path A would require the LLM to extract these facts from the raw chunks — and given the same document, might well produce a similar answer. The difference is Path B pays zero LLM tokens at query time to do it.

7. Results: Where It Broke or Tied

Latency: effectively identical for local embeddings

With local embeddings (sentence-transformers, no API call), retrieval latency was nearly identical for both paths after the first query. The first query incurs a ~10-second model loading penalty; subsequent queries run at ~35ms regardless of path.

QueryPath APath B
eq2 (vector vs BM25)36 ms36 ms
eq3 (student sim)34 ms33 ms
eq4 (hybrid + RRF)34 ms34 ms
eq5 (naive RAG)34 ms34 ms
eq6 (AQN vs keyword)38 ms38 ms
eq7 (student profiles)36 ms35 ms
eq8 (M1→M3/M4)36 ms36 ms
eq9 (AQN findings)35 ms35 ms
eq10 (why RAG fails)36 ms35 ms

The latency advantage of pre-compilation only materializes if you're using a remote embedding API or a remote LLM for synthesis. With local embeddings, both paths hit the same local vector lookup. The real advantage is token cost, not raw latency.

eq1: BM25 scoring — different documents, same topic

Path A retrieved priya-m3-feedback-bm25.md. Path B retrieved vikram-m3-feedback-bm25.md.

Both are student feedback documents on BM25. They cover similar material but with different emphasis. This isn't a failure — it's the system working as designed. Different synthesis prompts during artifact generation produced artifacts that matched different angles of the same question.

eq8: "How does M1 tokenization affect M3 and M4?" — the cross-document case

This query is most revealing. It's a cross-document question — M1 (the first module) affects M3 and M4 (later modules). This requires reasoning across module boundaries.

Path A retrieved indexzero-m0-search-is-ranking-not-lookup.md — a foundational module doc. Path B retrieved anjali-m2-feedback-inverted-index.md — an M2 feedback doc.

Path B's artifact answers do mention cross-module effects:

"The module warned that if the M1 tokenizer lowercased terms (like 'RAM' to 'ram') but query processing used uppercase ('RAM'), the lookup would fail silently because exact dictionary matching wouldn't work."

"M2 skips storing term positions, stating they're only needed for phrase queries not covered in that module. However, M5 (query processing) DOES need positions for phrase and proximity matching, so it re-creates them."

These are genuine cross-module insights. But they're partial — the M1→M4 chain is only partially covered. A truly rigorous answer would trace the full M1→M2→M3→M4 dependency chain explicitly, which neither path delivers without an explicit cross-document synthesis artifact.

This is where pre-compilation's coverage ceiling is most visible.

8. Limitations

Be honest about what this eval does and doesn't prove:

  1. Small corpus (265 docs, 398 chunks, 665 artifacts). The behavior on a 100K+ doc corpus may differ significantly — retrieval dynamics change at scale, and coverage gaps become more pronounced.
  2. 10 queries is not enough. This is illustrative, not statistically significant. A real evaluation would need 50–100+ queries with ground-truth labels to make strong claims.
  3. No human answer quality evaluation. We measured what documents were retrieved and what artifacts were returned. We did not have human evaluators score the actual answer quality of Path A (synthesized) vs Path B (pre-compiled). Path A might produce better answers in some cases due to more flexible synthesis.
  4. Naive chunking is a weak baseline. We used 500-character chunk splits with no overlap and no structural awareness (no header-based chunking, no sentence boundary preservation). A production RAG system would likely use smarter chunking, which would improve Path A significantly.
  5. Artifact generation quality is not measured. We synthesized 3 Q&A pairs per doc, but we didn't evaluate whether those 3 questions were the right 3 questions. Poor question design would produce artifacts that don't match real queries.
  6. Cross-document synthesis was not explicitly tested. Our corpus has strong cross-module dependencies (M1→M3/M4). We evaluated one query that touches this surface but didn't run a dedicated cross-doc artifact build.
  7. The eval doesn't measure build cost tradeoffs at scale. The crossover point — where build cost is amortized by query-time savings — depends on query volume, LLM pricing, and artifact quality.

9. Comparison to Pinecone Nexus

Pinecone announced Nexus in May 2026, positioning it as a "knowledge engine" that compiles knowledge at index time and serves structured answers at query time. This validates the core thesis we tested.

What Nexus adds over our eval:

What our eval adds over Nexus marketing:

The key unresolved question for Nexus specifically: how does it handle cross-document synthesis at scale, and how does it manage artifact freshness as documents update? Nexus addresses incremental rebuilds, but the coverage problem — generating the right artifacts for the right questions — remains a design challenge regardless of infrastructure.

10. When to Use Which Approach

This is not a binary choice. The decision depends on your query volume, corpus stability, and question predictability.

ScenarioRecommended approachReason
Bounded, known query surface (product docs, course content)Pre-compiled artifactsQuestions are predictable; build cost amortizes
High query volume (>10K queries/day)Pre-compiled artifactsToken savings compound at scale
Frequently updating corpus (daily news, changelogs)Hybrid: artifacts + naive fallbackArtifact staleness becomes a problem
Novel, creative, or exploratory questionsNaive chunks + synthesisCan't anticipate all question types
Cross-document reasoning requiredExplicit cross-doc artifacts OR naive RAGRequires deliberate artifact design
Small corpus + low query volumeNaive RAGBuild cost not worth it
Agentic workflows with stateful contextPre-compiled + answer quality notificationAgents benefit from pre-digested knowledge

The hybrid pattern is probably the right default for production systems: pre-compiled artifacts as the primary path, with a fallback to naive chunk retrieval when artifact similarity is below a threshold, or when queries explicitly cross corpus boundaries.

11. Open Questions

This eval raises more questions than it answers. The three most important:

Artifact density. Is 3 Q&A pairs per document the right density? Too few → coverage gaps. Too many → noise and retrieval dilution. We didn't test what ratio actually maximizes coverage without degrading signal.

Fallback strategy. What's the right trigger for switching from pre-compiled artifacts to naive chunk retrieval? Low vector similarity is a proxy, but it doesn't tell you whether the question is answerable by either path. Getting this wrong means either losing the artifact benefit or returning worse answers.

Build cost crossover. At what query volume does a one-time build cost become worth it vs paying synthesis cost per query? This depends on LLM pricing, corpus update frequency, and how often the answer changes. We didn't measure this — it's corpus- and pricing-dependent.

The honest answer: we don't know the optimal artifact design. This was a first data point, not a definitive answer.


Code and Data

The evaluation was implemented in Python with the following stack:

Build log and full eval results (JSON) are available on request.

This eval was designed to understand the tradeoffs, not to sell a product. The conclusion is that pre-compiled artifacts are a powerful pattern with real limitations — and that understanding both is necessary before building a system around either approach.