Research — Omri Uzan

Preprints

Document Optimization for Black-Box Retrieval via Reinforcement Learning

Omri Uzan, Ron Polonsky, Douwe Kiela, Christopher Potts

arXiv

We fine-tune LMs with RL to rewrite documents into better representations for a target black-box retriever, allowing smaller retrievers to match or beat larger retrievers on code and visual document retrieval tasks.

ArXiv Code

Selected Publications

Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization

Omri Uzan, Asaf Yehudai, Roi Pony, Eyal Shnarch, Ariel Gera

ICLR-26

GQR refines a primary retriever's query embedding at test time using guidance from a lightweight auxiliary retriever of a different modality, matching larger models while being up to 14x faster and 54x lighter.

ArXiv Code Website

CharBench: Evaluating the Role of Tokenization in Character-Level Tasks

Omri Uzan, Yuval Pinter

AAAI-26 (Oral)

We study character-level tasks in LMs and the effect of subword tokenization. We find that tokenization features are not correlated with performance on many character-level tasks, contrary to the common perception. We publish a benchmark for future work and reproduction.

ArXiv Benchmark

Greed is All You Need: An Evaluation of Tokenizer Inference Methods

Omri Uzan, Craig W. Schmidt, Chris Tanner, Yuval Pinter

ACL-24 (Oral, Outstanding Paper Award, Senior Area Chair Award)

We show that for many subword tokenizers, vocabulary construction and tokenization inference are separable components. Using a new intrinsic benchmark, we evaluate popular tokenizers and find that simple greedy inference performs surprisingly well across tokenization algorithms.

Paper Code

Tokenization Is More Than Compression

Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner

EMNLP-24 (Oral)

Introduces PathPiece, a tokenizer that minimizes token count, and uses it to show empirically that fewer tokens do not necessarily lead to better language models. The results suggest that pre-tokenization and vocabulary construction often matter more than compression rate alone.

Paper