Can VLMs Review Mobile UI? Introducing Scry Design Diff Eval
July 3, 2026
Generated UI is everywhere, and someone has to check it. When a coding agent implements a screen from a design reference, the question is not "do these two images differ?" — they always differ — but "what changed, where, and does a developer need to care?" Pixel-diff tools drown you in harmless rendering noise while missing a swapped icon or a wrong selected state. Vision-language models should be good at this. We built a benchmark to find out whether they are.
The benchmark
Scry Design Diff Eval contains 311 human-reviewed mobile UI pairs: a reference screenshot and a generated implementation of the same screen. 234 pairs carry 557 human-tagged defects — each one a defect tag (Icon/Nav, Typography, Missing Content, ...) plus a hand-drawn selection box — and 77 pairs with no tagged issues serve as controls. A model must return a structured issue list: tags and normalized boxes, not prose.
Scoring is deterministic and recall-first. A model issue counts only if it shares a tag with a human issue and its box overlaps (IoU ≥ 0.10) on the same image, matched one-to-one. Human annotations are known positives, not exhaustive ground truth, so extra model findings are reported as diagnostics rather than penalties.
Results
We evaluated six model endpoints on the full 311-pair set:
| Model | Known-Issue Recall | Diagnostic Precision | Issue F1 |
|---|---|---|---|
| Kimi K2.7 Code + Together recovery | 38.2% | 15.9% | 22.5% |
| Gemini 3.5 Flash | 37.5% | 20.2% | 26.3% |
| Codex GPT-5.5 xhigh | 37.3% | 13.6% | 19.9% |
| MiniMax M3 | 21.9% | 11.5% | 15.0% |
| Gemma 4 26B A4B | 20.3% | 12.4% | 15.4% |
| Gemma 4 31B | 17.8% | 13.3% | 15.2% |
Four things stood out:
- The best models catch roughly two fifths of known issues. The three leaders sit within five matched issues of each other — but Gemini 3.5 Flash gets there with hundreds fewer predictions, giving it the best precision and F1 of the high-recall group.
- Noticing a diff is easy; enumerating defects is hard. The leaders find something on 62–66% of defective screens, but issue-level recall is ~38% — on dense screens most annotated defects go unrecovered.
- Defect families differ wildly. Changed illustrations (
Image/Asset, 76.7%) andMissing Content(69.1%) are the easiest;Typographytops out at 20.0%. - Abstention is basically absent. The high-recall models flag 96–100% of the no-tagged controls. A review tool that flags every screen still needs a human triage pass.
The takeaway: current VLM endpoints are genuinely useful as recall-oriented review aids — and nowhere near a drop-in replacement for a human reviewer. Localization, fine-grained comparison (especially typography), and calibrated restraint are the open problems.
Get the data
- Full paper: blog.scrymore.com/paper
- Dataset (311 pairs, images + annotations): huggingface.co/datasets/epinnock/scry-design-diff-eval
- Interactive results: pilot dashboard and unmatched-prediction audit