The Challenge
ColPali-based multimodal retrievers have achieved impressive results in visual document retrieval by directly embedding page images and scoring textual query tokens against image patches. However, recent attempts to push this paradigm further have relied on massive scaling of query and document representations. The leading* model on the ViDoRe leaderboard, llama-nemoretriever-colembed-3b, requires 10 MB of memory per document page – three orders of magnitude more than single-vector dense retrievers. This massive overhead creates real obstacles for deployment in production systems. Additionally, purely vision-centric approaches may be constrained by the inherent modality gap exhibited by modern vision-language models.
Memory Requirements Comparison
| Model | Vectors per Page | Token Dim | # Floats per Page | Memory per Page | Storage per 1M Docs (GB) | Modality |
|---|---|---|---|---|---|---|
| LINQ-EMBED-MISTRAL | 1 | 4096 | 4096 | 16 KB | 7.63 | Text |
| QWEN3-EMBEDDING-4B | 1 | 2560 | 2560 | 10 KB | 4.77 | Text |
| JINA-EMBEDDINGS-V4 (Image) | 767 | 128 | 98176 | 196 KB | 182.87 | Image |
| COLNOMIC-EMBED-MULTIMODAL-7B | 767 | 128 | 98176 | 196 KB | 182.87 | Image |
| LLAMA-NEMORETRIEVER-COLEMBED-3B | 1802 | 3072 | 5535744 | 10.6 MB | 10,311.13 | Image |
Note: Storage assumes 16-bit representation. Green rows = lightweight text models, Orange rows = medium-cost vision models, Red row = high-cost vision model.
The Research Question
In practice, text and vision retrievers often provide complementary signals, succeeding and failing on different subsets of queries. Motivated by this observation, we ask:
Can we leverage the complementary strengths of lightweight text retrievers to enhance these powerful but resource-intensive models?
Yet traditional hybrid retrieval methods combine retrievers only at the level of ranks or scores. This coarse aggregation fails to exploit the richer information contained within each model’s representation space. Can we instead leverage representation-level signals for more effective aggregation?
Our Solution: Guided Query Refinement (GQR)
To address this, we introduce our algorithm Guided Query Refinement (GQR). The key idea is to optimize the query embedding of a primary retriever using the complementary signal provided by another retriever’s scores. Rather than fusing rankings or scores, we refine the representation itself so that it better aligns with cross-modal evidence.
How It Works:
- Candidate Pool Creation: Both retrievers retrieve their top-K documents, forming a union pool
- Distribution Alignment: Create probability distributions over candidates from both retrievers
- Iterative Refinement: Optimize the primary query embedding by minimizing KL divergence between the primary distribution and a consensus distribution (average of both retrievers)
- Final Ranking: Use the refined query to re-score documents and produce the final ranking
Key Results
14× Faster
ColNomic + GQR matches LLAMA-NEMO's performance while being 14× faster at query time
54× Less Memory
Requires only 0.20 MB per document vs. 10.6 MB for LLAMA-NEMO, enabling practical deployment at scale
+3.9% Relative Gain
Consistent improvements across all model pairs, outperforming traditional hybrid methods
Performance on ViDoRe 2 (NDCG@5)
| Primary Model | Complementary Model | Average | Δ vs Base | GQR |
|---|---|---|---|---|
| ColNomic-7B | – | 60.3 | – | ❌ |
| ColNomic-7B | + Linq-Embed | 62.8 | +2.5 | ✅ |
| ColNomic-7B | + Jina (text) | 63.1 | +2.8 | ✅ |
| Jina (vision) | – | 57.2 | – | ❌ |
| Jina (vision) | + Linq-Embed | 61.2 | +4.0 | ✅ |
| Jina (vision) | + Jina (text) | 60.7 | +3.5 | ✅ |
| Llama-Nemo | – | 63.0 | – | ❌ |
| Llama-Nemo | + Linq-Embed | 65.2 | +2.2 | ✅ |
| Llama-Nemo | + Jina (text) | 64.2 | +1.2 | ✅ |
Pushing the Pareto Frontier
GQR enables ColPali-based models to achieve an optimal trade-off between performance, latency, and memory. While the strongest baseline model (LLAMA-NEMO) achieves NDCG@5 of 62.9 at 2,591ms per query with 10.6 MB per document, our approach allows models with smaller representations to match or exceed this performance with dramatically lower resource requirements:
- ColNomic + GQR (Linq): 62.7 NDCG@5, 181ms (14× faster), 0.20 MB (54× less memory)
- ColNomic + GQR (Jina): 63.0 NDCG@5, 350ms (7× faster), 0.37 MB (28× less memory)
Even when applied to the already-strong LLAMA-NEMO, GQR provides additional gains, pushing performance from 63.0 to 65.2 NDCG@5.
GQR vs Cross-Encoder Reranking
Cross-encoder rerankers are a common approach to improve retrieval quality by applying full query-document attention to top-K candidates. However, they incur substantial computational overhead. We compare GQR against MonoQwen2-VL-v0.1, an open-weights multimodal reranker.
| Method | Latency (ms) | NDCG@5 | Recall@5 | Relative Latency |
|---|---|---|---|---|
| ColNomic-7B - GQR | 181 | 62.75 | 58.0 | |
| ColNomic-7B (no-reranking) | 116 (−65) | 60.25 (−2.50) | 57.32 (−0.68) | 1.5x Faster |
| ColNomic-7B - Top 5 Reranking | 1,823 (+1,642) | 62.12 (−0.63) | 57.32 (−0.68) | 10x Slower |
| ColNomic-7B - Top 10 Reranking | 3,587 (+3,406) | 64.37 (+1.62) | 59.92(+1.92) | 20x Slower |
| ColNomic-7B - Top 20 Reranking | 7,036 (+6,855) | 65.07 (+2.32) | 60.27 (+2.27) | 40x Slower |
Key Takeaways: GQR provides a compelling alternative to cross-encoder reranking. For ColNomic-7B, GQR achieves higher performance than top-5 reranking (62.75 vs 62.12 NDCG@5) while being 10× faster. The top-10 and top-20 reranking pipelines outperform GQR with a significant time cost of 3.5 and 7 seconds respectively. GQR is 20× / 40x faster while achieving better performance than the baseline.
Key Insights
💡 Weaker Models Provide Value
Text-only models like Qwen3 (NDCG@5: 46.8) can enhance much stronger vision-centric models like LLAMA-NEMO (NDCG@5: 63.0), despite a 16.2 point performance gap. The complementary signal matters despite absolute performance differences.
⚡ Pareto Optimal Trade-offs
GQR achieves optimal performance-efficiency balance across both time and memory dimensions. It enables smaller models to match larger ones effectively.
🎯 Competitive with Rerankers, Much Faster
GQR delivers performance competitive with cross-encoder rerankers while being much faster, making it practical for production environments.
🔧 Architecture Agnostic
Works with any combination of single-vector and multi-vector retrievers.
If you've made it this far, we encourage you to dive into our paper for the full details!
BibTeX
@misc{uzan2025guidedqueryrefinementmultimodal,
title={Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization},
author={Omri Uzan and Asaf Yehudai and Roi pony and Eyal Shnarch and Ariel Gera},
year={2025},
eprint={2510.05038},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.05038},
}