Self-Aug: Teaching Vision Models to Pick Their Own Medicine

Large language models hallucinate. Ask one a question and it will produce a fluent, confident, yet factually wrong answer — a side effect of being trained to predict likely next tokens rather than to be correct.

Extend an LLM with a vision encoder. You get a Large Vision-Language Model (LVLM): same autoregressive language model at the core, but with image tokens projected into its input alongside text tokens. The model can now describe pictures, answer questions about them, read charts. However, now it can also hallucinate about them. Ask it what color the coat is, and it might tell you "Black" with the same confidence whether the coat is red, blue, or not in the picture at all. The visual grounding is shakier than it looks, because the underlying language model still wants to produce plausible-sounding text whether the image actually supports it.

One promising fix happens at decoding time, and it's called Contrastive Decoding. The intuition is simple. Imagine asking two people the same question: one who can see the image clearly, and one who is looking at a degraded version of it. Whatever they both say is probably coming from general world knowledge or language patterns, not from actually looking at the picture. Whatever the clear-eyed one says that the other one doesn't is the part of the answer that genuinely depends on visual evidence. Contrastive Decoding does exactly this with a single model: run it twice, once on the real image and once on a corrupted version, and subtract the second output distribution from the first. What's left is the visually grounded signal.

Why Generic Corruptions Fall Short

Most contrastive decoding methods for LVLMs corrupt the image with something generic. Add Gaussian noise. Crop a random patch. Earlier methods used diffusion-style noise to degrade the image. The reasoning is that any reasonable corruption should weaken the model's grasp of the image, which lets the contrast surface the visually grounded part of the answer.

The problem is that not all corruptions disrupt all questions. If I ask, "what color is the coat?", adding noise might blur the texture but leave the color roughly intact, so the corrupted run answers almost the same as the clean run, and the contrast yields nothing. The right corruption for that question is color inversion: flip every pixel to its complement and the model genuinely cannot answer. Now the contrast is informative, because the two runs disagree precisely on the dimension the question is asking about.

This generalizes. A question about left-versus-right positioning needs a horizontal flip. A counting question needs a random mask that can hide objects. A question about reading text needs noise that smudges fine detail. The corruption should target the visual property the question is probing, and a generic perturbation will not.

So, the question is: how to pick the corruption? Earlier work has tried various heuristics, but the most natural answer is also the most ambitious. Let the model itself pick.

Letting the Model Pick the Corruption

Self-Aug architecture diagram showing Self-Augmentation Selection Prompting and Sparsity Adaptive Truncation

Self-Aug has two pieces:

The first, Self-Augmentation Selection (SAS), is a prompting trick. The model is given a meta-prompt listing six possible image corruptions with one-line descriptions of what each disrupts: color inversion breaks color identification, random mask breaks counting and presence, and so on. It then reads the actual user question, picks the corruption that would most invalidate the question's premise, and that choice gets applied to the image to produce the contrast pair.

What makes this work is that the model already knows what color inversion does to a scene, or what a horizontal flip means for left/right reasoning, because language training has given it that knowledge.

A Confidence-Aware Filter

The second piece, Sparsity Adaptive Truncation (SAT), fixes a quieter issue with how contrastive decoding filters its candidates. Earlier methods used a fixed threshold to discard low-probability tokens before applying the contrast. SAT makes that threshold respond to the model's confidence: when the output distribution is sharp (the model knows what it wants to say), the threshold tightens to keep only top candidates; when the distribution is spread out (the model is genuinely uncertain), the threshold loosens so that plausible alternatives are not thrown away.

Self-Aug results showing corruption selection and threshold behavior across different query types

The figure above shows both pieces in action. On the left, a question about whether the wind is blowing a flag where both the clean run and the corrupted run rank "No" as their top answer, but the contrast flips the final output to the correct "Yes." On the right, a description task where the corrupted run wants to say the dog is "blue" (a hallucination triggered by the noise corruption itself), and SAT's threshold is tight enough at that token position to filter the bad candidate out entirely.

Does It Work?

The paper evaluates Self-Aug on five LVLMs (LLaVA-1.5 in 7B and 13B, Qwen-VL, InstructBLIP, Qwen3-VL-8B) across seven benchmarks covering both discriminative tasks (yes/no and multiple choice) and open-ended generation. Self-Aug consistently improves factual consistency over multinomial sampling and prior contrastive decoding methods, with relative gains as large as 18.78% on InstructBLIP. The improvements hold across model scales and across benchmark types.

Where It Breaks

Several limitations are worth flagging. The corruption set is fixed at six predefined operations (flips, crops, masks, noise, color inversion), which covers common visual properties but cannot target more specialized ones. A question about fine-grained texture or specific spatial geometry might not have a good match in the toolbox.

The selection step also depends on the base model's ability to reason over the meta-prompt: a less capable model may pick a poor corruption, and the paper notes that the 7B and 13B LLaVA models only agree with a GPT-4o-mini "oracle" on the right corruption around 64–66% of the time.

The deepest issue, though, is in the assumption underneath all contrastive decoding. The method assumes the corrupted run is worse than the clean run, so subtracting it sharpens the answer toward the visually grounded signal. But on Qwen3-VL-32B, the strongest model evaluated, the corrupted run sometimes scores higher than the clean run, and the contrast then drags the final answer in the wrong direction. The implication is significant: contrastive decoding's effectiveness depends on a capability gap between the two runs, and as base models get stronger, generic image corruptions may not be enough to produce that gap.

ℹ️

Contrastive Decoding Lineage

Self-Aug sits at the end of a clear lineage: from text-only Contrastive Decoding (Li et al.), to Visual Contrastive Decoding with fixed noise (VCD), to query-aware corruption selection (VACoDe), to Self-Aug's model-driven selection. Each step makes the "amateur" signal more targeted.

Contrastive Decoding (Li et al.) — Started the lineage in pure NLP. Contrasts a strong "expert" model against a weaker "amateur" model. Self-Aug inherits this mechanism but reinterprets the amateur as the model running on a corrupted image.
Visual Contrastive Decoding (VCD) — Brings CD into the LVLM setting. Constructs the amateur run by adding diffusion-style Gaussian noise. Self-Aug argues that a single fixed corruption is too blunt — different questions need different degradations.
VACoDe — Self-Aug's predecessor. Runs every candidate corruption through the model and picks the one producing the largest logit divergence. Costs multiple forward passes. Self-Aug avoids this by getting the choice from a single text-only prompt.
MARINE — Opposite direction: instead of exposing and subtracting the language prior, it strengthens the visual signal by running an external object detector and injecting detected objects as grounding information via classifier-free guidance.
Activation Steering Decoding (ASD) — Different angle entirely. Identifies hallucination-associated directions in the model's hidden state space and steers activations directly, rather than corrupting inputs. Where Self-Aug perturbs what the model sees, ASD perturbs what the model thinks.

This is a literature survey written for the UCLA Natural Language Processing course (Spring 2026).

Why Generic Corruptions Fall Short

Letting the Model Pick the Corruption

A Confidence-Aware Filter

Does It Work?

Where It Breaks

Related Work