Can VLMs Review Mobile UI? Introducing Scry Design Diff Eval

July 3, 2026

Generated UI is everywhere, and someone has to check it. When a coding agent implements a screen from a design reference, the question is not "do these two images differ?" — they always differ — but "what changed, where, and does a developer need to care?" Pixel-diff tools drown you in harmless rendering noise while missing a swapped icon or a wrong selected state. Vision-language models should be good at this. We built a benchmark to find out whether they are.

The benchmark

Scry Design Diff Eval contains 311 human-reviewed mobile UI pairs: a reference screenshot and a generated implementation of the same screen. 234 pairs carry 557 human-tagged defects — each one a defect tag (Icon/Nav, Typography, Missing Content, ...) plus a hand-drawn selection box — and 77 pairs with no tagged issues serve as controls. A model must return a structured issue list: tags and normalized boxes, not prose.

Scoring is deterministic and recall-first. A model issue counts only if it shares a tag with a human issue and its box overlaps (IoU ≥ 0.10) on the same image, matched one-to-one. Human annotations are known positives, not exhaustive ground truth, so extra model findings are reported as diagnostics rather than penalties.

Results

We evaluated six model endpoints on the full 311-pair set:

Model	Known-Issue Recall	Diagnostic Precision	Issue F1
Kimi K2.7 Code + Together recovery	38.2%	15.9%	22.5%
Gemini 3.5 Flash	37.5%	20.2%	26.3%
Codex GPT-5.5 xhigh	37.3%	13.6%	19.9%
MiniMax M3	21.9%	11.5%	15.0%
Gemma 4 26B A4B	20.3%	12.4%	15.4%
Gemma 4 31B	17.8%	13.3%	15.2%

Four things stood out:

The best models catch roughly two fifths of known issues. The three leaders sit within five matched issues of each other — but Gemini 3.5 Flash gets there with hundreds fewer predictions, giving it the best precision and F1 of the high-recall group.
Noticing a diff is easy; enumerating defects is hard. The leaders find something on 62–66% of defective screens, but issue-level recall is ~38% — on dense screens most annotated defects go unrecovered.
Defect families differ wildly. Changed illustrations (Image/Asset, 76.7%) and Missing Content (69.1%) are the easiest; Typography tops out at 20.0%.
Abstention is basically absent. The high-recall models flag 96–100% of the no-tagged controls. A review tool that flags every screen still needs a human triage pass.

The takeaway: current VLM endpoints are genuinely useful as recall-oriented review aids — and nowhere near a drop-in replacement for a human reviewer. Localization, fine-grained comparison (especially typography), and calibrated restraint are the open problems.

Get the data

Full paper: blog.scrymore.com/paper
Dataset (311 pairs, images + annotations): huggingface.co/datasets/epinnock/scry-design-diff-eval
Interactive results: pilot dashboard and unmatched-prediction audit