Google Scholar

Preprints

Document Optimization thumbnail
Document Optimization for Black-Box Retrieval via Reinforcement Learning
Omri Uzan, Ron Polonsky, Douwe Kiela, Christopher Potts
arXiv

We fine-tune LMs with RL to rewrite documents into better representations for a target black-box retriever, allowing smaller retrievers to match or beat larger retrievers on code and visual document retrieval tasks.

Selected Publications

Guided Query Refinement thumbnail
Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
Omri Uzan, Asaf Yehudai, Roi Pony, Eyal Shnarch, Ariel Gera
ICLR-26

GQR refines a primary retriever's query embedding at test time using guidance from a lightweight auxiliary retriever of a different modality, matching larger models while being up to 14x faster and 54x lighter.

CharBench thumbnail
CharBench: Evaluating the Role of Tokenization in Character-Level Tasks
Omri Uzan, Yuval Pinter
AAAI-26 (Oral)

We study character-level tasks in LMs and the effect of subword tokenization. We find that tokenization features are not correlated with performance on many character-level tasks, contrary to the common perception. We publish a benchmark for future work and reproduction.

Greed is All You Need thumbnail
Greed is All You Need: An Evaluation of Tokenizer Inference Methods
Omri Uzan, Craig W. Schmidt, Chris Tanner, Yuval Pinter
ACL-24 (Oral, Outstanding Paper Award, Senior Area Chair Award)

We show that for many subword tokenizers, vocabulary construction and tokenization inference are separable components. Using a new intrinsic benchmark, we evaluate popular tokenizers and find that simple greedy inference performs surprisingly well across tokenization algorithms.

Tokenization Is More Than Compression thumbnail
Tokenization Is More Than Compression
Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner
EMNLP-24 (Oral)

Introduces PathPiece, a tokenizer that minimizes token count, and uses it to show empirically that fewer tokens do not necessarily lead to better language models. The results suggest that pre-tokenization and vocabulary construction often matter more than compression rate alone.