MiniMax-M2.7 as a multi-hop retrieval agent on HotpotQA — the good, the bad, and the surprising gaps between two vector databases.
I wanted to know one thing: does giving a small reasoning model a search engine actually help on hard multi-hop questions? Not just "can it retrieve" but "can it plan, search, and bridge facts the way a human researcher would?"
The setup was simple. Take HotpotQA — a dataset of multi-hop questions where you can't answer in one search. Give a model two tools: vector search and keyword search. See what happens.
Model: MiniMax-M2.7, a small reasoning model (not the biggest, not the most expensive). The hypothesis was that smaller models are cheaper to run and might be sufficient for tool-calling decisions — you don't need a frontier model to decide what to search for, only to understand the answer.
Dataset: HotpotQA (distractor split), 100 questions ranging from 2-hop to 5-hop. 2-hop questions require connecting two facts. 3-hop requires three. One question in our set required five separate facts to be stitched together.
Two vector databases: LanceDB and Qdrant. Same underlying model (all-MiniLM-L6-v2, 384-dim embeddings). Same corpus. Same retrieval algorithm. We wanted to know if the database mattered for agent-level performance, not just raw recall.
Two-stage evaluation:
59 of the 100 questions were solved by baseline RRF alone — the top-5 already contained the needed documents. Those didn't need an agent. The remaining 41 were genuinely hard.
| Method | R@5 (all 100) |
|---|---|
| Vector (LanceDB) | 79.5% |
| BM25 (keyword) | 63.0% |
| RRF Fusion | 79.0% |
79% sounds decent until you realize that's across all 100 questions. On the 41 hard questions — the ones where a single search doesn't cut it — RRF dropped to 48.8%. For a 3-hop question where you need Doc A to find Doc B to find Doc C, you can't brute-force your way there with a single retrieval.
The model was prompted to think out loud, choose between vector and keyword search strategically, and not repeat queries. Max 4 turns. Temperature 0.3.
"You are a precise research assistant searching a document corpus. For multi-hop: find first fact, then use it to refine the next search. State your confidence (HIGH/MEDIUM/LOW) when answering."
Here's what a 2-hop question looked like for the agent:
Clean. Two targeted searches, two answers, done. But this was an easy one — baseline RRF also gets this one. The interesting cases are where baseline fails but the agent persists.
Three separate searches to bridge: James Henry Miller → Ewan MacColl → Peggy Seeger → American. The agent correctly identified the chain. Baseline on this question would have failed — you can't answer "What nationality was X's wife?" in a single search.
The agent pivots three times: Argentina → seventh-largest country → India → moth genus. That's a 4-turn reasoning chain. The question doesn't mention India directly — you have to infer that "seventh-largest country" refers to India, then search for moths there. The agent does exactly that.
41 questions had RRF recall < 1.0 at baseline. These are the ones that actually needed the agent.
Agent RRF over baseline on 3-hop questions (LanceDB)
| Hop count | N hard | Baseline RRF R@5 | LanceDB Agent R@5 | Qdrant Agent R@5 | Agent avg turns |
|---|---|---|---|---|---|
| 2-hop | 26 | 50.0% | 94.2% | 86.5% | 3.9 |
| 3-hop | 10 | 50.0% | 95.0% | 80.0% | 3.9 |
| 4-hop | 4 | 37.5% | 87.5% | 87.5% | 4.0 |
| 5-hop | 1 | 50.0% | 100% | 100% | 4.0 |
| All hard (41) | 41 | 48.8% | 93.9% | 85.4% | 3.9 / 3.7 |
This was the surprising part. Same embeddings, same algorithm, same corpus — but the agent got different results on the two databases. On 6 of the 41 hard questions, the agent found the answer on LanceDB but failed on Qdrant. Not by different rankings — by complete whiff.
Example:
At 4 turns, the agent has seen 20 retrieved documents (5 per turn × 4 turns). If Qdrant ranks the right document at position 7 instead of position 3 on the critical search, the agent may never surface it. These are not retrieval failures — both databases return the correct document somewhere in top-10. They are attention failures: the agent doesn't see what it doesn't see early enough in the conversation to build on it.
1. Small models can do real multi-hop reasoning with tools. MiniMax-M2.7 navigated 2, 3, 4, and even 5-hop chains. It chose the right tool for each sub-query. It didn't repeat searches. It built on previous findings. This wasn't prompt engineering tricks — the model was making genuine sequential decisions.
2. The agent earns its turns on hard questions. Baseline RRF on the 41 hard questions was 48.8%. The agent got 93.9% (LanceDB). That's a 45 percentage point gap — meaningful, and real.
3. Database ordering matters more at agent-level than at retrieval-level. Both databases had identical RRF recall (79%) on the full set. But at agent-level, LanceDB pulled ahead by 8.5pp. The agent is sensitive to subtle ranking differences that raw recall metrics miss.
4. The easy/hard split is real and predictable. RRF solved 59/100 questions in one shot. Only 41 needed an agent. A smart pipeline that routes based on baseline performance could save ~60% of agent API calls on datasets where this split holds.
I want to be honest about what didn't work. The model takes 3.9 turns on average — not 1. That's 4 API calls per question, each with a 2,048 token context. At $0.015/1M tokens for MiniMax-M2.7, the cost per question is ~$0.00025 in inference — but multiply that across millions of questions and it adds up fast. And this model, while small by frontier standards, still hallucinated documents on a few questions. The tool-calling was correct; the final answer sometimes wasn't.
The original hypothesis was that smaller models would be sufficient for tool-calling while larger models handle reasoning. MiniMax-M2.7 is small, and it did handle the reasoning — but "small enough" is still not "free." The cost-performance tradeoff on real workloads needs more study.
The eval pipeline is running at build-indexzero.pages.dev. The code is open. The traces are saved. If you want to reproduce this, swap the embedding model, change the prompting strategy, or try a different fusion algorithm — the setup is there.
More interesting: what happens when you combine the smart-routing idea (baseline first, agent only for hard questions) with a model that can do 8+ turns without drifting? The 4-turn limit in this eval was a self-imposed constraint — the model regularly used all 4 turns, which suggests it wanted more. For the hardest 4-hop questions, does giving it 6 turns close the remaining gap?
That's the next experiment.