Research Notes
Honest writeups on AI systems

We Gave a Small Model a Search Engine. The Results Were Not What We Expected.

MiniMax-M2.7 as a multi-hop retrieval agent on HotpotQA — the good, the bad, and the surprising gaps between two vector databases.

I wanted to know one thing: does giving a small reasoning model a search engine actually help on hard multi-hop questions? Not just "can it retrieve" but "can it plan, search, and bridge facts the way a human researcher would?"

The setup was simple. Take HotpotQA — a dataset of multi-hop questions where you can't answer in one search. Give a model two tools: vector search and keyword search. See what happens.

The Setup

Model: MiniMax-M2.7, a small reasoning model (not the biggest, not the most expensive). The hypothesis was that smaller models are cheaper to run and might be sufficient for tool-calling decisions — you don't need a frontier model to decide what to search for, only to understand the answer.

Dataset: HotpotQA (distractor split), 100 questions ranging from 2-hop to 5-hop. 2-hop questions require connecting two facts. 3-hop requires three. One question in our set required five separate facts to be stitched together.

Two vector databases: LanceDB and Qdrant. Same underlying model (all-MiniLM-L6-v2, 384-dim embeddings). Same corpus. Same retrieval algorithm. We wanted to know if the database mattered for agent-level performance, not just raw recall.

Two-stage evaluation:

59 of the 100 questions were solved by baseline RRF alone — the top-5 already contained the needed documents. Those didn't need an agent. The remaining 41 were genuinely hard.

What Baseline Retrieval Looked Like

MethodR@5 (all 100)
Vector (LanceDB)79.5%
BM25 (keyword)63.0%
RRF Fusion79.0%

79% sounds decent until you realize that's across all 100 questions. On the 41 hard questions — the ones where a single search doesn't cut it — RRF dropped to 48.8%. For a 3-hop question where you need Doc A to find Doc B to find Doc C, you can't brute-force your way there with a single retrieval.

The Agent

The model was prompted to think out loud, choose between vector and keyword search strategically, and not repeat queries. Max 4 turns. Temperature 0.3.

"You are a precise research assistant searching a document corpus. For multi-hop: find first fact, then use it to refine the next search. State your confidence (HIGH/MEDIUM/LOW) when answering."

Here's what a 2-hop question looked like for the agent:

Q: Which magazine was started first: Arthur's Magazine or First for Women?
GT: ["Arthur's Magazine", "First for Women"] · Answer: Arthur's Magazine
Turn 1 [bm25_search]: q="Arthur's Magazine launch date" → Doc 5, Doc 40, Doc 47
Turn 2 [bm25_search]: q="First for Women magazine founded year" → Doc 7, Doc 32, Doc 33
Turn 3: FINAL — Arthur's Magazine (1844) predates First for Women (1989). Answer: Arthur's Magazine. Recall: 2/2.

Clean. Two targeted searches, two answers, done. But this was an easy one — baseline RRF also gets this one. The interesting cases are where baseline fails but the agent persists.

The 3-Hop Case

Q: What nationality was James Henry Miller's wife?
GT: ["Peggy Seeger", "Ewan MacColl"] · Answer: American
Turn 1 [bm25_search]: q="James Henry Miller wife nationality" → Doc 37, Doc 39, Doc 34
Turn 2 [bm25_search]: q="Ewan MacColl wife married" → Doc 37, Doc 848, Doc 32
Turn 3 [bm25_search]: q="Peggy Seeger nationality" → Doc 37, Doc 905, Doc 454
Turn 4: FINAL — Peggy Seeger is American. Recall: 2/2.

Three separate searches to bridge: James Henry Miller → Ewan MacColl → Peggy Seeger → American. The agent correctly identified the chain. Baseline on this question would have failed — you can't answer "What nationality was X's wife?" in a single search.

The 4-Hop Case

Q: Which genus of moth in the world's seventh-largest country contains only one species?
GT: ["Indogrammodes", "India"] · Answer: Crambidae
Turn 1 [vector_search]: q="Argentina moth genus monotypic only one species" → Doc 65, Doc 69, Doc 68
Turn 2 [bm25_search]: q="Argentina moth genus endemic only one species" → Doc 65, Doc 66, Doc 64
Turn 3 [bm25_search]: q="seventh largest country world area ranking" → Doc 990, Doc 568, Doc 55
Turn 4 [vector_search]: q="India monotypic moth genus only one species" → Doc 67, Doc 65, Doc 69

The agent pivots three times: Argentina → seventh-largest country → India → moth genus. That's a 4-turn reasoning chain. The question doesn't mention India directly — you have to infer that "seventh-largest country" refers to India, then search for moths there. The agent does exactly that.

The Results on Hard Questions

41 questions had RRF recall < 1.0 at baseline. These are the ones that actually needed the agent.

+45pp

Agent RRF over baseline on 3-hop questions (LanceDB)

Hop count N hard Baseline RRF R@5 LanceDB Agent R@5 Qdrant Agent R@5 Agent avg turns
2-hop 26 50.0% 94.2% 86.5% 3.9
3-hop 10 50.0% 95.0% 80.0% 3.9
4-hop 4 37.5% 87.5% 87.5% 4.0
5-hop 1 50.0% 100% 100% 4.0
All hard (41) 41 48.8% 93.9% 85.4% 3.9 / 3.7

Where LanceDB and Qdrant Diverge

This was the surprising part. Same embeddings, same algorithm, same corpus — but the agent got different results on the two databases. On 6 of the 41 hard questions, the agent found the answer on LanceDB but failed on Qdrant. Not by different rankings — by complete whiff.

Example:

Q: The Thoen Stone is on display at a museum in what county?
GT: ["The Thoen Stone", "Loudoun County"]
LanceDB (4 turns): recall=1.0 — agent finds both documents
Qdrant (4 turns): recall=0.5 — agent finds only 1 of 2

At 4 turns, the agent has seen 20 retrieved documents (5 per turn × 4 turns). If Qdrant ranks the right document at position 7 instead of position 3 on the critical search, the agent may never surface it. These are not retrieval failures — both databases return the correct document somewhere in top-10. They are attention failures: the agent doesn't see what it doesn't see early enough in the conversation to build on it.

What This Tells Us

1. Small models can do real multi-hop reasoning with tools. MiniMax-M2.7 navigated 2, 3, 4, and even 5-hop chains. It chose the right tool for each sub-query. It didn't repeat searches. It built on previous findings. This wasn't prompt engineering tricks — the model was making genuine sequential decisions.

2. The agent earns its turns on hard questions. Baseline RRF on the 41 hard questions was 48.8%. The agent got 93.9% (LanceDB). That's a 45 percentage point gap — meaningful, and real.

3. Database ordering matters more at agent-level than at retrieval-level. Both databases had identical RRF recall (79%) on the full set. But at agent-level, LanceDB pulled ahead by 8.5pp. The agent is sensitive to subtle ranking differences that raw recall metrics miss.

4. The easy/hard split is real and predictable. RRF solved 59/100 questions in one shot. Only 41 needed an agent. A smart pipeline that routes based on baseline performance could save ~60% of agent API calls on datasets where this split holds.

The Catch

I want to be honest about what didn't work. The model takes 3.9 turns on average — not 1. That's 4 API calls per question, each with a 2,048 token context. At $0.015/1M tokens for MiniMax-M2.7, the cost per question is ~$0.00025 in inference — but multiply that across millions of questions and it adds up fast. And this model, while small by frontier standards, still hallucinated documents on a few questions. The tool-calling was correct; the final answer sometimes wasn't.

The original hypothesis was that smaller models would be sufficient for tool-calling while larger models handle reasoning. MiniMax-M2.7 is small, and it did handle the reasoning — but "small enough" is still not "free." The cost-performance tradeoff on real workloads needs more study.

What Comes Next

The eval pipeline is running at build-indexzero.pages.dev. The code is open. The traces are saved. If you want to reproduce this, swap the embedding model, change the prompting strategy, or try a different fusion algorithm — the setup is there.

More interesting: what happens when you combine the smart-routing idea (baseline first, agent only for hard questions) with a model that can do 8+ turns without drifting? The 4-turn limit in this eval was a self-imposed constraint — the model regularly used all 4 turns, which suggests it wanted more. For the hardest 4-hop questions, does giving it 6 turns close the remaining gap?

That's the next experiment.

Methodology note: All evaluations used HotpotQA (distractor split), 100 questions, 992 passages. Embedding model: all-MiniLM-L6-v2 (384-dim). Vector databases: LanceDB (local) and Qdrant (local, collection "hotpot_100_smart"). Agent model: MiniMax-M2.7 with temperature 0.3, max_tokens 2,048, max 4 turns. Baseline was run on all 100 questions; agent evaluation was run only on the 41 hard questions where RRF R@5 < 1.0. Full traces and raw results available on request.