Can VLMs Review Mobile UI? Introducing Scry Design Diff Eval

July 3, 2026

Generated UI is everywhere, and someone has to check it. When a coding agent implements a screen from a design reference, the question is not "do these two images differ?" — they always differ — but "what changed, where, and does a developer need to care?" Pixel-diff tools drown you in harmless rendering noise while missing a swapped icon or a wrong selected state. Vision-language models should be good at this. We built a benchmark to find out whether they are.

The benchmark

Scry Design Diff Eval contains 311 human-reviewed mobile UI pairs: a reference screenshot and a generated implementation of the same screen. 234 pairs carry 557 human-tagged defects — each one a defect tag (Icon/Nav, Typography, Missing Content, ...) plus a hand-drawn selection box — and 77 pairs with no tagged issues serve as controls. A model must return a structured issue list: tags and normalized boxes, not prose.

Scoring is deterministic and recall-first. A model issue counts only if it shares a tag with a human issue and its box overlaps (IoU ≥ 0.10) on the same image, matched one-to-one. Human annotations are known positives, not exhaustive ground truth, so extra model findings are reported as diagnostics rather than penalties.

Results

We evaluated six model endpoints on the full 311-pair set:

ModelKnown-Issue RecallDiagnostic PrecisionIssue F1
Kimi K2.7 Code + Together recovery38.2%15.9%22.5%
Gemini 3.5 Flash37.5%20.2%26.3%
Codex GPT-5.5 xhigh37.3%13.6%19.9%
MiniMax M321.9%11.5%15.0%
Gemma 4 26B A4B20.3%12.4%15.4%
Gemma 4 31B17.8%13.3%15.2%

Four things stood out:

  • The best models catch roughly two fifths of known issues. The three leaders sit within five matched issues of each other — but Gemini 3.5 Flash gets there with hundreds fewer predictions, giving it the best precision and F1 of the high-recall group.
  • Noticing a diff is easy; enumerating defects is hard. The leaders find something on 62–66% of defective screens, but issue-level recall is ~38% — on dense screens most annotated defects go unrecovered.
  • Defect families differ wildly. Changed illustrations (Image/Asset, 76.7%) and Missing Content (69.1%) are the easiest; Typography tops out at 20.0%.
  • Abstention is basically absent. The high-recall models flag 96–100% of the no-tagged controls. A review tool that flags every screen still needs a human triage pass.

The takeaway: current VLM endpoints are genuinely useful as recall-oriented review aids — and nowhere near a drop-in replacement for a human reviewer. Localization, fine-grained comparison (especially typography), and calibrated restraint are the open problems.

Get the data