Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization

Uzan, Omri; Pony, Roi; Shnarch, Eyal; Gera, Ariel

Guided Query Refinement (GQR): Multimodal Hybrid Retrieval with Test-Time Optimization

Omri Uzan¹, Asaf Yehudai^2,3, Roi Pony^² Eyal Shnarch², Ariel Gera²

¹Stanford University, ²IBM Research, ³The Hebrew University of Jerusalem

The Challenge

ColPali-based multimodal retrievers have achieved impressive results in visual document retrieval by directly embedding page images and scoring textual query tokens against image patches. However, recent attempts to push this paradigm further have relied on massive scaling of query and document representations. The leading^* model on the popular ViDoRe leaderboard, llama-nemoretriever-colembed-3b, requires 10 MB of memory per document page – three orders of magnitude more than single-vector dense retrievers. This massive overhead creates real obstacles for deployment in production systems. Additionally, purely vision-centric approaches may be constrained by the inherent modality gap exhibited by modern vision-language models.

Memory Requirements Comparison

Model	Vectors per Page	Token Dim	# Floats per Page	Memory per Page	Storage per 1M Docs (GB)	Modality
LINQ-EMBED-MISTRAL	1	4096	4096	16 KB	7.63	Text
QWEN3-EMBEDDING-4B	1	2560	2560	10 KB	4.77	Text
JINA-EMBEDDINGS-V4 (Image)	767	128	98176	196 KB	182.87	Image
COLNOMIC-EMBED-MULTIMODAL-7B	767	128	98176	196 KB	182.87	Image
LLAMA-NEMORETRIEVER-COLEMBED-3B	1802	3072	5535744	10.6 MB	10,311.13	Image

Note: Storage assumes 16-bit representation. Green rows = lightweight text models, Orange rows = medium-cost vision models, Red row = high-cost vision model.

^*As of October 8, 2025.

Our Solution: Guided Query Refinement (GQR)

Can we leverage complementary signals from lightweight text retrievers to enhance these powerful but resource-intensive models?

Traditional hybrid retrieval methods combine retrievers at the level of ranks (e.g., Reciprocal Rank Fusion) or scores (weighted averaging). These approaches are coarse and fail to exploit the rich information within each model's representation space.

GQR Algorithm Architecture showing the two-stage process: candidate pool creation and iterative query refinement

GQR operates at a deeper level – it refines the query representation itself. At test time, GQR iteratively optimizes the primary (right on the plot) retriever's query embedding using gradient descent, guided by similarity scores from a complementary retriever (left). This allows the refined query to softly incorporate the complementary signal while remaining grounded in the primary retriever's representation space.

How It Works:

Candidate Pool Creation: Both retrievers retrieve their top-K documents, forming a union pool
Distribution Alignment: Create probability distributions over candidates from both retrievers
Iterative Refinement: Optimize the primary query embedding by minimizing KL divergence between the primary distribution and a consensus distribution (average of both retrievers)
Final Ranking: Use the refined query to re-score documents and produce the final ranking

Key Results

14× Faster

ColNomic + GQR matches LLAMA-NEMO's performance while being 14× faster at query time

54× Less Memory

Requires only 0.20 MB per document vs. 10.6 MB for LLAMA-NEMO, enabling practical deployment at scale

+3.9% Relative Gain

Consistent improvements across all model pairs, outperforming traditional hybrid methods

Performance on ViDoRe 2 (NDCG@5)

Primary Model	Complementary Model	Average	Δ vs Base	GQR
ColNomic-7B	–	60.3	–	❌
ColNomic-7B	+ Linq-Embed	62.8	+2.5	✅
ColNomic-7B	+ Jina (text)	63.1	+2.8	✅
Jina (vision)	–	57.2	–	❌
Jina (vision)	+ Linq-Embed	61.2	+4.0	✅
Jina (vision)	+ Jina (text)	60.7	+3.5	✅
Llama-Nemo	–	63.0	–	❌
Llama-Nemo	+ Linq-Embed	65.2	+2.2	✅
Llama-Nemo	+ Jina (text)	64.2	+1.2	✅

Pushing the Pareto Frontier

GQR enables ColPali-based models to achieve an optimal trade-off between performance, latency, and memory. While the strongest baseline model (LLAMA-NEMO) achieves NDCG@5 of 62.9 at 2,591ms per query with 10.6 MB per document, our approach allows models with smaller representations to match or exceed this performance with dramatically lower resource requirements:

ColNomic + GQR (Linq): 62.7 NDCG@5, 181ms (14× faster), 0.20 MB (54× less memory)
ColNomic + GQR (Jina): 63.0 NDCG@5, 350ms (7× faster), 0.37 MB (28× less memory)

Even when applied to the already-strong LLAMA-NEMO, GQR provides additional gains, pushing performance from 63.0 to 65.2 NDCG@5.

Latency-Performance Pareto Frontier showing GQR models dominating baseline models — Figure: Latency–quality tradeoff in online querying. The X axis is runtime in milliseconds for a single query on a log scale, and the Y axis is the average score (NDCG@5). Marker color encodes the base retriever; Empty squares indicate the primary retriever alone (without applying GQR). Filled marker shape encodes GQR with different complementary retrievers.

GQR vs Cross-Encoder Reranking

Cross-encoder rerankers are a common approach to improve retrieval quality by applying full query-document attention to top-K candidates. However, they incur substantial computational overhead. We compare GQR against MonoQwen2-VL-v0.1, an open-weights multimodal reranker.

Method	Latency (ms)	NDCG@5	Recall@5	Relative Latency
ColNomic-7B - GQR	181	62.75	58.0
ColNomic-7B (no-reranking)	116 (−65)	60.25 (−2.50)	57.32 (−0.68)	1.5x Faster
ColNomic-7B - Top 5 Reranking	1,823 (+1,642)	62.12 (−0.63)	57.32 (−0.68)	10x Slower
ColNomic-7B - Top 10 Reranking	3,587 (+3,406)	64.37 (+1.62)	59.92(+1.92)	20x Slower
ColNomic-7B - Top 20 Reranking	7,036 (+6,855)	65.07 (+2.32)	60.27 (+2.27)	40x Slower

Key Takeaways: GQR provides a compelling alternative to cross-encoder reranking. For ColNomic-7B, GQR achieves higher performance than top-5 reranking (62.75 vs 62.12 NDCG@5) while being 10× faster. The top-10 and top-20 reranking pipelines outperform GQR with a significant time cost of 3.5 and 7 seconds respectively. GQR is 20× / 40x faster while achieving better performance than the baseline.

Key Insights

💡 Weaker Models Provide Value

Text-only models like Qwen3 (NDCG@5: 46.8) can enhance much stronger vision-centric models like LLAMA-NEMO (NDCG@5: 63.0), despite a 16.2 point performance gap. The complementary signal matters despite absolute performance differences.

⚡ Pareto Optimal Trade-offs

GQR achieves optimal performance-efficiency balance across both time and memory dimensions. It enables smaller models to match larger ones effectively.

🎯 Competitive with Rerankers, Much Faster

GQR delivers performance competitive with cross-encoder rerankers while being much faster, making it practical for production environments.

🔧 Architecture Agnostic

Works with any combination of single-vector and multi-vector retrievers.

If you've made it this far, we encourage you to dive into our paper for the full details!

BibTeX

@misc{uzan2025guidedqueryrefinementmultimodal,
        title={Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization}, 
        author={Omri Uzan and Asaf Yehudai and Roi pony and Eyal Shnarch and Ariel Gera},
        year={2025},
        eprint={2510.05038},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2510.05038}, 
  }