Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

Rheeya Uppaal1, Phu Mon Htut2, Min Bai2, Nikolaos Pappas2, Zheng Qi2, Sandesh Swamy2
1University of Wisconsin-Madison 2AWS AI Labs

Today’s reasoning-augmented Vision–Language Models (VLMs) don’t just give answers — they show their work. But what if that “work” is wrong… even when the final answer is right?
Reasoning-chain faithfulness does not always align with final-answer correctness.
Figure 1. Reasoning-chain faithfulness does not always align with final-answer correctness. (a–b) Visually unfaithful reasoning chains that nonetheless yield correct answers on perception tasks. (c) A visually faithful chain producing an incorrect answer, where the error arises from reasoning rather than perception.

Reasoning-augmented vision–language models (VLMs) promise transparency by revealing not just their answers, but the steps used to reach them. Yet this transparency can be deceptive. As shown in Figure 1, a model may produce the correct final answer while its intermediate reasoning invents visual details that never appeared in the image - a lack of visual faithfulness. Conversely, a model may faithfully describe what it sees, yet still arrive at the wrong conclusion due to errors purely in logic.

This reveals a deeper issue: current evaluation for VLMs test how well the model "sees" an image by measuring final answer accuracy on perception based questions. But in Figure 2, we highlight that this can be misleading - final accuracy and reasoning faithfulness diverge sharply, and in many examples the model appears to “solve” the question without relying on its own stated reasoning.

Accuracy vs Faithfulness
Figure 2. Reasoning faithfulness and final-answer accuracy diverge. Correct final answers are not always grounded in the image, and incorrect answers can still reflect visually faithful reasoning. Evaluating only final accuracy overlooks if the reasoning process itself attends to the visual evidence. The weak correspondence between final-answer correctness and reasoning-chain faithfulness shows that accuracy metrics alone cannot capture whether a model’s reasoning genuinely reflects what it “sees.”
Causal structure of prediction
Figure 3. Causal structure underlying final answers and reasoning traces. Many evaluation protocols assume the final answer y is produced via the reasoning chain R (orange arrows). However, models can also map hidden features h directly to y (red arrow) via spurious correlations or language priors, bypassing R. Thus, high final-answer accuracy does not guarantee that intermediate reasoning steps are visually faithful.
Because this disconnect between accuracy and visual grounding reveals a genuinely new problem, we introduce both a metric to measure visual faithfulness in reasoning chains and a mitigation strategy to improve it.

Measuring Visual Faithfulness in Reasoning Chains

Rather than relying on special detectors or handcrafted rules that could potentially generalize poorly, we test a very straightforward approach first - using an off-the-shelf VLM as a judge, and evaluating each step independently.

The judge first breaks down the model answer into steps, and then determines whether a step is a Perception step (describing visual content) or a Reasoning step (drawing inferences). For all Perception steps, it then evaluates the visual faithfulness. An example is illustrated in Figure 4.

Measuring faithfulness with VLM judge.
Figure 4. Evaluation of Reasoning Chain Visual Faithfulness through a Judge. Left: Input provided to the judge model. Right: Output of the judge.

This approach design aligns closely with human annotation, and the paper’s human correlation study shows that a strong VLM judge—such as Claude 4 Sonnet—tracks human ratings with high reliability.

Correlations of Judges with Human Annotations
Table 1. Comparison of various Judge models on the task of measuring visual faithfulness. The labels of each judge are compared against two sets of human annotations, using ICC 3-1 as a correlation measure. Correlations above 0.6 are considered acceptable.

How does this compare to using the answer generating model itself? Compared to having the generating model assess its own reasoning, an external judge is far more capable (more details in our paper). Models are notoriously overconfident about their own outputs and often reproduce their own hallucinations instead of flagging them. External judging provides the objectivity needed for evaluation at scale.

Simply prompting a VLM judge to evaluate visual faithfulness in reasoning chains is powerful - it shows high correlation with human judgement.

Improving Faithfulness: Knowing When and How to Intervene

Once we can detect unfaithful perception steps, the next question is how to fix them. We frame this as a when + how problem.

Why this framing? The reasoning chains produced by modern VLMs are long and interleave perception steps with logical reasoning steps. Intervening everywhere would potentially disrupt the logical reasoning steps of generation (they aren't intended to look at the image, and forcing the model to reference the image at this point may worsen reasoning ability), while intervening too rarely leaves hallucinations untouched. The key insight is that interventions should occur only when an unfaithful perception step is detected, ensuring the model is corrected precisely at the point where its grounding fails.

The "when" method is a simple off the shelf VLM detector. We use Claude 3.7 Sonnet, and show simply prompting this model is significantly stronger than using trained detectors that utilize the generating VLM's internal states.

The “how” is a lightweight self-reflection mechanism. When the detector identifies an unfaithful step, the model is prompted to regenerate only that step with explicit instructions to examine the image more carefully. If the revision is faithful, the reasoning chain generation continues onward from the corrected point; if not, the regeneration is repeated up to a small retry limit. This keeps the corrections local, preserving most of the chain while improving grounding exactly where needed.

Faithfulness Task Performance
Figure 5. Impact of our method on the visual faithfulness of reasoning chains. Both methods (vanilla and ours) use the same underlying model. Our method significantly reduces the unfaithful perception step rate (left), while also improving final answer accuracy (right).

The results in Figure 5 demonstrate large improvements in visual grounding (measured as lower Unfaithful Perception Rate, UPR) across multiple benchmarks and datasets. Interestingly, accuracy on the final answers also improves in many cases. Strengthening intermediate perception helps the model reason better overall, indicating that faithful grounding is not just about interpretability—it also benefits task performance.

When the model makes a visually unfaithful claim, we catch it with a VLM judge and ask the model to rethink just that step. It works surprisingly well!

TL;DR

Overall, we see this work as an intial step: the main goal was to clearly surface the problem and offer an initial toolkit, and we hope it encourages the community to explore stronger, more practical ways to make multimodal reasoning truly faithful.

Thanks for reading!