Under Review · 2026

Beyond Scalar Distances:
Semantic Attribute Gradients from Frozen MLLMs

Turning a frozen multimodal LLM into a training-time supervisor that teaches a vision encoder the fine-grained attributes that matter for zero-shot retrieval.

1 University of Illinois at Urbana-Champaign  ·  * Equal contribution

SAGA teaser: class-label scalar supervision vs MLLM attribute-resolved supervision for an Indigo Bunting and a Blue Grosbeak
Class labels reduce supervision to a scalar; an MLLM resolves it into attributes. An Indigo Bunting and a Blue Grosbeak share deep-blue plumage and gray legs and differ only in their wing pattern. A class-label loss collapses this to a single "different," pushing every embedding dimension apart. A frozen MLLM instead names which attributes match and which differ before reaching its verdict — and SAGA turns that into a learning signal for the encoder.

Abstract

Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose SAGA, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.

TL;DR

SAGA replaces the single scalar of class-label metric learning with attribute-resolved supervision distilled from a frozen MLLM, training a vision encoder whose embeddings capture the fine-grained attributes that matter for zero-shot retrieval. The MLLM is discarded at inference, so deployment costs nothing extra.

Method

The vision encoder emits patch tokens that a frozen MLLM compares pairwise, describing attributes and judging same/different class. Three signals shape the encoder and pooler:

ℒ_GRPO

Correct same/different verdicts are rewarded via GRPO. The policy gradient flows back through the frozen backbone, pushing the encoder to expose the discriminative attributes the MLLM relied on.

ℒ_KL

An attention-alignment loss distills the MLLM's spatial attention into a lightweight pooler, so it weights attribute-relevant regions rather than spurious cues.

ℒ_DML

A standard deep-metric-learning loss shapes the pooled embedding geometry for nearest-neighbour retrieval.

SAGA training pipeline: vision encoder tokens fed to a frozen MLLM, rewarded with GRPO, with attention alignment and deep metric learning losses
Training. Image pairs are encoded into patch tokens and handed to the frozen MLLM, which describes attributes and outputs a verdict. Correct verdicts (r = 1) earn positive advantage under GRPO; the gradient reinforces the encoder directions that carry the deciding attributes. In parallel, the MLLM's attention supervises the retrieval pooler (ℒKL) and a metric loss (ℒDML) organizes the embedding space.
SAGA inference pipeline: image to vision encoder to pooler to embedding, then cosine similarity for top-K retrieval
Inference. The MLLM is discarded. Only the trained vision encoder and pooler remain — an image maps to a single embedding and retrieval is plain cosine nearest-neighbour search, matching the deployment cost of any standard metric-learning pipeline.

Results

Zero-shot image retrieval on four fine-grained benchmarks. All methods share the same Qwen3-VL-8B vision tower; baselines use mean pooling. SAGA improves Recall@1 by 3–6 points over the strongest prior baselines.

+6.3
Recall@1 on CUB-200 over the best baseline
87.9%
Recall@1 on CUB-200 (97.0% on Cars-196)
Extra inference cost — the MLLM is discarded
Method CUB-200 Cars-196 Aircraft iNat-Aves
R@1R@4NMI R@1R@4NMI R@1R@4NMI R@1R@4NMI
Pre-trained backbone 75.691.80.77 70.788.50.49 53.176.10.43 42.264.80.65
Proxy Anchor 79.592.00.79 93.497.30.84 73.192.80.68 54.173.10.72
Potential Field 81.692.90.81 93.797.80.86 77.493.20.72 55.675.00.73
SAGA (ours) 87.996.30.83 97.098.60.89 83.593.90.77 60.177.10.80

Recall@1, Recall@4 (%) and Normalized Mutual Information (NMI, ∈ [0,1]). iNat-Aves is our bird subset of iNaturalist-2021. SAGA values are means over 3 seeds. Best per column in the highlighted row.

BibTeX

If you find our work useful, please consider citing:

@article{bhatnagar2026saga,
    title   = {Beyond Scalar Distances: Semantic Attribute Gradients
               from Frozen MLLMs for Visual Embeddings},
    author  = {Bhatnagar, Shubhang and Baiju, Dheeraj and Ahuja, Narendra},
    journal = {arXiv preprint arXiv:2606.15134},
    year    = {2026}
}