Scry Design Diff Eval: Measuring VLMs as Mobile UI Regression Reviewers

Scry Design Diff Eval is a benchmark built to evaluate vision-language models on the task of mobile UI diff review.

Each example pairs a reference screen with a generated implementation screen and uses human-drawn selection boxes plus explicit defect tags.

The current release contains a full 311-pair benchmark.

Abstract

Vision-language models are increasingly used to inspect generated interfaces, but practical UI review is not just image similarity. A useful reviewer must identify developer-actionable defects: missing components, wrong icons, typographic drift, incorrect state styling, and layout regressions.

We introduce Scry Design Diff Eval, a benchmark built from human-reviewed mobile UI pairs. Each example contains a reference screenshot and a generated implementation screenshot. Human annotations contain explicit defect tags and normalized selection boxes on one or both images. The primary metric is known-issue recall: the fraction of human-tagged issues recovered by a model with both a matching tag and overlapping selection box.

On the full 311-pair benchmark with 557 tagged issues, the highest-recall runs recover roughly two fifths of known issues. Kimi K2.7 Code with Together retry recovery obtains 38.2% known-issue recall, Gemini 3.5 Flash obtains 37.5%, and Codex GPT-5.5 xhigh obtains 37.3%. Gemini 3.5 Flash is the strongest balanced result in this snapshot, with 20.2% diagnostic precision and the best issue F1 among high-recall models. These results suggest that current VLM endpoints can identify many mobile UI defects, but still miss most known human issues and over-flag most no-diff controls under a strict tag-and-box matching protocol.

1. Introduction

Generated UI workflows increasingly produce screenshots that must be compared against product references or design targets. A reviewer needs to do more than notice that two images differ. They need to say what changed, where it changed, and whether the change matters to a developer.

Traditional visual regression testing emphasizes pixels or perceptual image distance. Those tools are valuable, but they are brittle for product review: harmless rendering differences can create large pixel deltas, while serious regressions can be localized and semantically subtle. A button can have the wrong selected state, an icon can be swapped for a similar asset, or a missing row can break the screen without dominating a global image metric.

Vision-language models appear well suited to this setting because they can jointly inspect images and produce structured text.

The benchmark asks a model to compare a reference mobile screenshot with a generated implementation screenshot and return a list of UI defects with tags and selection boxes. The scoring protocol is recall-first because the human annotations are known positives, not exhaustive negatives. Extra model findings may be valid and are reported diagnostically rather than used to reduce the primary score.

The main contributions are:

  • A mobile UI image-diff benchmark built from human-reviewed generated screens.
  • A structured annotation format with explicit defect tags and normalized boxes.
  • A deterministic tag-and-box judge for known-issue recall.

3. Dataset Construction

The source annotation tool presents reviewers with two images:

  • Image A: the reference target, usually a mobile app screenshot.
  • Image B: the generated implementation screenshot.

Reviewers draw a selection box on Image A, Image B, or both. They then attach one or more defect tags. Notes are supported by the annotation tool, but notes are missing on many annotations and are not used for primary v1 scoring.

The current local snapshot starts from completed reviewed screens. The dataset builder keeps only annotations that have explicit tags and at least one usable selection box. Screens with no scored tagged issues become no-tagged controls. This means no-tagged controls are a practical proxy, not a guarantee that the screen is defect-free.

QuantityCount
Eval pairs311
Visual-diff pairs234
No-tagged controls77
Counted tagged issues557

The dataset uses issue-density buckets:

SplitDefinitionFull-set pairs
single_issue1 tagged issue107
multi_issue2-3 tagged issues80
dense_issue4 or more tagged issues47
no_diff0 scored tagged issues77

These buckets are not manually curated difficulty labels. They measure annotation density. A future version could add a separate curated difficulty field.

The most common full-set tags are Icon/Nav, Color/Background, Spacing/Layout, Shape/Size, Missing Content, and Typography. The tag distribution is intentionally not balanced because it reflects the actual annotation workflow.

4. Task Definition

Given a reference screenshot and a generated screenshot, a model must return a JSON object with an issues array. Each issue describes one visible UI defect.

{
  "issues": [
    {
      "labels": ["Icon/Nav"],
      "note": "The bottom navigation icon differs from the reference.",
      "box_a": {"x": 0.10, "y": 0.90, "w": 0.12, "h": 0.07},
      "box_b": {"x": 0.10, "y": 0.90, "w": 0.12, "h": 0.07},
      "confidence": 0.80
    }
  ]
}

Boxes are normalized to the image size, with x, y, w, and h in the range 0 to 1. A model may supply a box on either image or both images. The scorer only uses boxes that are present and valid.

The desired output is not a paragraph describing the screen. It is a structured issue list that can be compared against known human annotations and inspected by a developer.

5. Scoring Protocol

The primary score is known-issue recall. A model issue matches a human issue when both conditions hold:

  • The model issue and human issue share at least one explicit defect tag.
  • The model box overlaps the human box on the same image side with IoU >= 0.10.

Matching is one-to-one: each human issue can be matched once, and each model issue can be matched once. This prevents a single broad prediction from recovering multiple annotations unless it is represented as multiple issues.

MetricDefinitionRole
Known-Issue RecallMatched human issues / all human issuesPrimary
Diff Screen RecallVisual-diff screens with at least one matched issue / all visual-diff screensDiagnostic
No-Tagged Flag RateNo-tagged screens where the model predicts at least one issue / all no-tagged screensDiagnostic
Diagnostic PrecisionMatched model issues / all model issuesDiagnostic
Unmatched Predicted IssuesModel issues that do not match a known human issueDiagnostic

The protocol is recall-first because human annotations are not exhaustive. Extra model findings can be valid, especially on no-tagged controls. They are surfaced for review rather than directly penalizing the primary metric.

The deterministic judge is intentionally simple and auditable. It does not use a separate VLM to decide whether an extra finding is valid.

6. Experimental Setup

The study began with open or openly available VLM endpoints and was later expanded with two hosted high-recall comparators. The 60-row pilot is stratified as 15 examples per bucket: no-tagged controls, single-issue screens, multi-issue screens, and dense-issue screens. The full benchmark contains 311 pairs and is evaluated with the same deterministic tag-and-box judge.

ModelProviderStatus
minimax/minimax-m3OpenRouter60-row pilot complete; full 311-row run complete
google/gemma-4-31b-itOpenRouter60-row pilot complete; full 311-row run complete
google/gemma-4-26b-a4b-itOpenRouter60-row pilot complete; full 311-row run complete
moonshotai/kimi-k2.7-codeOpenRouter + Together retry recovery60-row pilot complete; full 311-row run complete after retrying failed OpenRouter rows
gemini-3.5-flashGoogle Gemini API, OpenAI-compatible endpointFull 311-row run complete
codex-cli/gpt-5.5 xhighCodex CLIFull 311-row run complete

Three static controls validate the pipeline:

  • always-no-issue: predicts no issues.
  • always-generic-issue: predicts one vague issue for every pair.
  • oracle-notes: copies ground-truth tags and boxes.

The oracle control should score 100%, confirming that the tag-and-box judge can recover exact known annotations. The always-generic control should not recover issues, confirming that vague screen-level reporting is insufficient.

Kimi K2.7 Code requires special reporting. After OpenRouter-side retry passes, the OpenRouter full artifact still contained only 169 valid rows, 129 empty outputs, and 13 malformed outputs. We retried the 142 failed rows through Together AI using the same model ID and merged the valid retry outputs back into the full prediction file. The final Kimi artifact therefore measures the model with recovery, not the raw OpenRouter route alone.

7. Results

7.1 Static Controls

ModelKnown-Issue RecallDiff Screen RecallNo-Tagged Flag RateDiagnostic Precision
always-no-issue0.0%0.0%0.0%0.0%
always-generic-issue0.0%0.0%100.0%0.0%
oracle-notes100.0%100.0%0.0%100.0%

7.2 60-Row Hosted Pilot

The 60-row pilot contains 45 visual-diff screens, 15 no-tagged controls, and 140 scored human tagged issues. Coverage counts input rows with a prediction row present, including empty or malformed model outputs.

ModelCoverageKnown-Issue RecallDiff Screen RecallNo-Tagged Flag RateDiagnostic Precision
MiniMax M360/6016.4% (23/140)40.0% (18/45)86.7% (13/15)10.8% (23/212)
Gemma 4 31B60/6011.4% (16/140)28.9% (13/45)100.0% (15/15)11.0% (16/146)
Gemma 4 26B A4B60/6015.7% (22/140)31.1% (14/45)100.0% (15/15)12.1% (22/182)
Kimi K2.7 Code60/6023.6% (33/140)48.9% (22/45)60.0% (9/15)19.6% (33/168)

Kimi K2.7 Code has the highest scored known-issue recall in the pilot after retrying failed rows, while MiniMax M3 leads among the fully parseable hosted baselines. Gemma 4 31B and Gemma 4 26B A4B flag all 15 no-tagged controls, which shows that they are recall-leaning and noisy under the current no-tagged proxy.

Kimi K2.7 Code should be interpreted primarily as a route and format-stability result. After the 600-second retry pass, 42 of 60 rows returned valid JSON under the tested route; 16 rows were empty and 2 rows were invalid JSON.

7.3 Full 311-Row Results

The full benchmark contains 311 pairs: 234 visual-diff screens, 77 no-tagged controls, and 557 scored human issues. All full-run prediction artifacts are fully parseable except MiniMax M3, which has one malformed output row out of 311.

ModelCoverageKnown-Issue RecallDiff Screen RecallNo-Tagged Flag RateDiagnostic PrecisionIssue F1
Kimi K2.7 Code + Together recovery311/31138.2% (213/557)65.8% (154/234)100.0% (77/77)15.9% (213/1,340)22.5%
Gemini 3.5 Flash311/31137.5% (209/557)62.4% (146/234)96.1% (74/77)20.2% (209/1,033)26.3%
Codex GPT-5.5 xhigh311/31137.3% (208/557)64.5% (151/234)98.7% (76/77)13.6% (208/1,530)19.9%
MiniMax M3311/31121.9% (122/557)40.6% (95/234)93.5% (72/77)11.5% (122/1,065)15.0%
Gemma 4 26B A4B311/31120.3% (113/557)35.9% (84/234)85.7% (66/77)12.4% (113/913)15.4%
Gemma 4 31B311/31117.8% (99/557)32.5% (76/234)79.2% (61/77)13.3% (99/744)15.2%

Kimi K2.7 Code with recovery has the highest known-issue recall, but it flags every no-tagged control. Gemini 3.5 Flash is the strongest balanced result in this snapshot: it is only four matched issues behind Kimi, while producing 307 fewer predicted issues and the highest F1 among the high-recall models. Codex GPT-5.5 xhigh is effectively tied with the two leaders on recall, but is the most verbose high-recall run.

7.4 Defect-Density Breakdown

ModelSingle-Issue RecallMulti-Issue RecallDense-Issue Recall
Kimi K2.7 Code + Together recovery53.3%46.3%26.2%
Gemini 3.5 Flash49.5%40.0%30.8%
Codex GPT-5.5 xhigh50.5%47.9%24.2%
MiniMax M328.0%26.8%15.8%
Gemma 4 26B A4B22.4%22.6%17.7%
Gemma 4 31B21.5%20.0%14.6%

Dense screens separate the leading models differently from the aggregate score. Gemini 3.5 Flash recovers 80 of 260 dense-screen issues, ahead of Kimi's 68 and Codex's 63. Kimi and Codex remain stronger on single- and multi-issue screens.

7.5 Category Breakdown

Defect TagBest ModelBest Recall
Color/BackgroundKimi K2.7 Code + Together recovery40.8% (69/169)
Icon/NavKimi K2.7 Code + Together recovery40.4% (69/171)
Image/AssetGemini 3.5 Flash76.7% (23/30)
Missing ContentGemini 3.5 Flash69.1% (67/97)
Shape/SizeCodex GPT-5.5 xhigh51.9% (56/108)
Spacing/LayoutCodex GPT-5.5 xhigh62.7% (69/110)
TypographyKimi K2.7 Code + Together recovery20.0% (18/90)
SeparatorCodex GPT-5.5 xhigh35.0% (7/20)
Extra ElementGemini 3.5 Flash60.0% (3/5)
Selected/Disabled StateKimi K2.7 Code + Together recovery and Gemini 3.5 Flash66.7% (4/6)

Two rare tags with fewer than five scored annotations (other, 4 issues, and badimage, 1 issue) are omitted from this table.

Generated Study Tables

These tables are rendered from the current dataset, score, and prediction artifacts so the web paper can be checked against the executable study state.

Dataset Snapshot

Rows
311
Visual-diff pairs
234
No-tagged controls
77
Tagged issues
557

Sources: data/idf_eval.jsonl and reports/pilot/idf_pilot_15x4.jsonl

dense_issue 47multi_issue 80no_diff 77single_issue 107

60-Row Hosted Pilot

ModelCoverageKnown-Issue RecallDiff Screen RecallNo-Tagged Flag RateDiagnostic PrecisionUnmatched Predictions
MiniMax M3results/minimax-m3-pilot/scores.json60/6016.4% (23/140)40.0% (18/45)86.7% (13/15)10.8% (23/212)189
Gemma 4 31Bresults/gemma-4-31b-pilot/scores.json60/6011.4% (16/140)28.9% (13/45)100.0% (15/15)11.0% (16/146)130
Gemma 4 26B A4Bresults/gemma-4-26b-a4b-pilot/scores.json60/6015.7% (22/140)31.1% (14/45)100.0% (15/15)12.1% (22/182)160
Kimi K2.7 Coderesults/kimi-k2.7-code-pilot/scores.json60/6023.6% (33/140)48.9% (22/45)60.0% (9/15)19.6% (33/168)135

Completed Full Runs

ModelCoverageKnown-Issue RecallDiff Screen RecallNo-Tagged Flag RateDiagnostic PrecisionUnmatched Predictions
Kimi K2.7 Code + Together recoveryresults/kimi-k2.7-code-full-combined-20260628-retryfix2-together-20260702/scores.json311/31138.2% (213/557)65.8% (154/234)100.0% (77/77)15.9% (213/1340)1127
Gemini 3.5 Flashresults/gemini-3.5-flash-full-20260703/scores.json311/31137.5% (209/557)62.4% (146/234)96.1% (74/77)20.2% (209/1033)824
Codex GPT-5.5 xhighresults/codex-cli-gpt-5.5-xhigh-full-20260701/scores.json311/31137.3% (208/557)64.5% (151/234)98.7% (76/77)13.6% (208/1530)1322
MiniMax M3results/minimax-m3-full-combined-20260628/scores.json311/31121.9% (122/557)40.6% (95/234)93.5% (72/77)11.5% (122/1065)943
Gemma 4 26B A4Bresults/gemma-4-26b-a4b-full-combined-20260628/scores.json311/31120.3% (113/557)35.9% (84/234)85.7% (66/77)12.4% (113/913)800
Gemma 4 31Bresults/gemma-4-31b-full-combined-20260628/scores.json311/31117.8% (99/557)32.5% (76/234)79.2% (61/77)13.3% (99/744)645

Prediction Format Health

ModelRowsValid JSONEmptyMalformed
Kimi K2.7 Code + Together recoveryresults/kimi-k2.7-code-full-combined-20260628-retryfix2-together-20260702/predictions.jsonl31131100
Gemini 3.5 Flashresults/gemini-3.5-flash-full-20260703/predictions.jsonl31131100
Codex GPT-5.5 xhighresults/codex-cli-gpt-5.5-xhigh-full-20260701/predictions.jsonl31131100
MiniMax M3results/minimax-m3-full-combined-20260628/predictions.jsonl31131001
Gemma 4 26B A4Bresults/gemma-4-26b-a4b-full-combined-20260628/predictions.jsonl31131100
Gemma 4 31Bresults/gemma-4-31b-full-combined-20260628/predictions.jsonl31131100

Unmatched Prediction Audit

These rows are diagnostic review queues. They are not primary-score false positives under the recall-first protocol.

ModelAudit RowsCSV Artifact
MiniMax M3189reports/pilot/minimax-m3-15x4-unmatched-predictions.csv
Gemma 4 31B130reports/pilot/gemma-4-31b-15x4-unmatched-predictions.csv
Gemma 4 26B A4B160reports/pilot/gemma-4-26b-a4b-15x4-unmatched-predictions.csv
Kimi K2.7 Code135reports/pilot/kimi-k2.7-code-15x4-unmatched-predictions.csv

Visual Examples

These figures mirror the annotation interface: reference and generated screenshots are shown side by side, with an issue-level model result matrix below.

Localized dense-screen exampleVivid ios Dec 2024 (1) · vivid__184 · dense_issue

A dense visual-diff example where several models recover at least one known issue, but issue-level recall remains incomplete.

Image A
vivid__184 image a1234
Image B
vivid__184 image b1234
Human issueKimi K2.7 CodeGemini 3.5 FlashCodex GPT-5.5MiniMax M3Gemma 4 26B A4BGemma 4 31B
Issue #1#835
Icon/NavShape/SizeImage/Asset

No note; scored from tag and box.

FoundIcon/Nav, Shape/SizeImage A IoU 0.77, pred #3FoundIcon/Nav, Shape/SizeImage A IoU 0.88, pred #1Not found5 screen predictionsFoundIcon/Nav, Shape/SizeImage B IoU 0.40, pred #3FoundIcon/Nav, Shape/SizeImage A IoU 0.79, pred #1FoundImage/AssetImage B IoU 0.78, pred #1
Issue #2#836
Shape/SizeImage/Asset

No note; scored from tag and box.

Not found5 screen predictionsFoundShape/SizeImage A IoU 0.84, pred #2Not found5 screen predictionsNot found4 screen predictionsFoundShape/SizeImage A IoU 0.73, pred #2FoundImage/AssetImage B IoU 0.75, pred #2
Issue #3#837
Shape/SizeIcon/NavColor/Background

No note; scored from tag and box.

Not found5 screen predictionsFoundIcon/Nav, Shape/SizeImage A IoU 0.84, pred #3FoundIcon/Nav, Shape/SizeImage A IoU 0.21, pred #3Not found4 screen predictionsFoundIcon/Nav, Shape/SizeImage A IoU 0.73, pred #3Not found4 screen predictions
Issue #4#838
Shape/Size

No note; scored from tag and box.

FoundShape/SizeImage A IoU 0.71, pred #5Not found3 screen predictionsFoundShape/SizeImage A IoU 0.88, pred #5FoundShape/SizeImage A IoU 0.67, pred #4Not found3 screen predictionsNot found4 screen predictions
No-tagged control exampleAmazon Shopping ios Aug 2025 · amazon-shopping__801 · no_diff

A no-tagged control used to inspect over-flagging behavior. Some model findings may still be valid, so these are diagnostics rather than primary-score penalties.

Image A
amazon-shopping__801 image a
Image B
amazon-shopping__801 image b
Human issueKimi K2.7 CodeGemini 3.5 FlashCodex GPT-5.5MiniMax M3Gemma 4 26B A4BGemma 4 31B
No scored human issue

No tagged human issue is available for this control screen.

Flagged 5No scored human issue in this rowFlagged 3No scored human issue in this rowFlagged 5No scored human issue in this rowNo issue predictedMatched no-diff controlFlagged 4No scored human issue in this rowFlagged 3No scored human issue in this row

8. Analysis

The full benchmark shows four recurring patterns.

First, screen-level detection is easier than issue-level recall. A model can notice that a screen differs while still missing most annotated defects on that screen. This is especially visible on dense examples, where recovering one issue counts for screen recall but leaves many known issues unmatched. The leading models reach 62-66% diff-screen recall, but only 37-38% known-issue recall.

Second, defect families vary in difficulty. The strongest category results are on Image/Asset and Missing Content, where Gemini 3.5 Flash recovers 76.7% and 69.1% of known issues. Frequent subtle categories remain harder: Typography tops out at 20.0%, Icon/Nav at 40.4%, and Color/Background at 40.8%. This is consistent with the visual nature of the task: missing content and changed illustrations often occupy larger, more semantically distinct regions, while typography, color, and icon differences can be small and localized.

Third, abstention is difficult. The no-tagged controls are not guaranteed clean, but they still reveal a practical behavior: every high-recall model reports issues on nearly every no-tagged control. Kimi flags 77 of 77 no-tagged controls, Codex flags 76 of 77, and Gemini flags 74 of 77. Under the recall-first protocol, this does not reduce known-issue recall, but it matters for product usability because a review tool that flags every screen requires substantial human triage.

Fourth, precision and recall trade off differently across models. Kimi K2.7 Code with Together recovery has the highest known-issue recall, but Gemini 3.5 Flash produces fewer unmatched predictions and has the best F1. Codex is nearly tied on recall but produces the largest prediction set, lowering diagnostic precision. The Gemma runs are more conservative and have better no-diff specificity, but they miss substantially more known issues.

The Kimi K2.7 Code run also adds a serving lesson: response format and route stability are part of benchmark performance. Even after OpenRouter-side retry passes, the OpenRouter full artifact had 142 failed rows. Retrying those rows through Together AI yielded a clean 311-row prediction file and changed Kimi from an output-health failure case into the highest-recall run. This should be reported as model-plus-route recovery, not as a clean single-endpoint result.

9. Limitations

The human annotations are known positives, not exhaustive ground truth. A model can identify a real issue that a human annotator missed. For this reason, unmatched predicted issues are diagnostics rather than primary false positives.

No-tagged controls are inferred from completed screens with zero scored tagged issues. They should not be treated as guaranteed defect-free. A future release should include a manual audit pass for no-tagged controls before making strong specificity claims.

The current split names measure issue density, not visual difficulty. A single-issue screen can still be hard if the issue is subtle, and a dense screen can be easy if all issues are large and obvious.

The multi-model comparison now covers the full 311-row set for MiniMax M3, Gemma 4 31B, Gemma 4 26B A4B, Kimi K2.7 Code with Together recovery, Gemini 3.5 Flash, and Codex GPT-5.5 xhigh. These runs are not all identical deployment conditions: Kimi combines OpenRouter and Together retry recovery, Gemini uses Google's OpenAI-compatible endpoint, and Codex uses the Codex CLI. The benchmark therefore compares practical eval runs, not isolated model weights under a single serving stack.

The current scoring is intentionally recall-first. Diagnostic precision is computed against known human annotations, but unmatched model predictions may include real defects missed by annotators. Precision and no-diff specificity should therefore be treated as operational diagnostics until unmatched findings and no-tagged controls receive a separate human audit.

10. Reproducibility and Release

The benchmark is represented as JSONL. Each row contains pair metadata, paths to Image A and Image B, the density split, and human ground-truth issues with tags and normalized boxes. Prediction files are also JSONL, with one model output per input row. Artifact filenames retain the benchmark's historical idf prefix, for example data/idf_eval.jsonl.

The core reproducibility flow is:

python scripts/build_dataset.py --output data/idf_eval.jsonl
python scripts/make_pilot_dataset.py \
  --dataset data/idf_eval.jsonl \
  --output reports/pilot/idf_pilot_15x4.jsonl \
  --manifest reports/pilot/idf_pilot_15x4.manifest.json \
  --per-bucket 15
python scripts/evaluate.py \
  --dataset reports/pilot/idf_pilot_15x4.jsonl \
  --predictions results/<model>-pilot/predictions.jsonl \
  --judge-provider tag_box \
  --judged-output results/<model>-pilot/judged.jsonl \
  --output results/<model>-pilot/scores.json

For the current study snapshot, the repository also emits a machine-readable manifest that records dataset summaries, score summaries, audit queue sizes, paper sections, core documentation, reproducibility scripts, byte sizes, and SHA-256 hashes. The manifest can be regenerated and checked with:

python scripts/make_study_manifest.py \
  --output-json reports/study_manifest.json \
  --output-md docs/study_manifest.md

python scripts/check_study_manifest.py \
  --manifest reports/study_manifest.json

The current manifest validates the local study artifacts, including the 311-row full dataset, the 60-row pilot dataset, four hosted pilot score sets, the six completed full runs, unmatched-prediction audit CSVs, output-health reports, paper markdown sections, and reproducibility scripts.

Hand-written report tables are checked against the score JSON files with:

python scripts/check_report_numbers.py

The unmatched-prediction audit is a secondary human-review layer. The web audit page can export and re-import decision JSON, and the exported decisions can be summarized with:

python scripts/summarize_audit_decisions.py \
  --decisions path/to/idf-unmatched-audit-decisions.json \
  --output-json reports/pilot/unmatched-audit-decisions.json \
  --output-md reports/pilot/unmatched-audit-decisions.md

The release package should include dataset JSONL, copied image assets or stable image references, score scripts, prompt templates, and a small example submission. A public release should also document the no-tagged control caveat and the recall-first interpretation of unmatched model predictions. Unmatched prediction CSVs should be treated as human audit queues rather than primary false-positive labels.

11. Conclusion

Scry Design Diff Eval evaluates VLMs as localized mobile UI regression reviewers. The task is intentionally stricter than captioning or broad screen-level comparison: a model must output a defect list with matching tags and boxes.

The current results show that VLM endpoints can recover many known UI defects, but still miss most human-tagged issues under deterministic tag-and-box scoring. On the full 311-pair benchmark, Kimi K2.7 Code with Together recovery reaches 38.2% known-issue recall, Gemini 3.5 Flash reaches 37.5%, and Codex GPT-5.5 xhigh reaches 37.3%. Gemini is the strongest balanced run in this snapshot because it keeps recall near the top while producing fewer unmatched predictions and the highest F1.

These numbers leave substantial room for improvement in fine-grained comparison, localization, structured output reliability, and calibrated abstention. The most recall-capable models still flag nearly every no-tagged control, which is acceptable for a recall-first benchmark but costly for a production review workflow.

The benchmark is most useful when read as a recall-first review aid evaluation: it measures how many known human issues a model catches, while preserving extra model findings for separate human analysis.

References

  • Linhao Zhang, Aiwei Liu, Yuan Liu, and Xiao Zhou. "DiffSpot: Can VLMs Spot

Fine-Grained Visual Differences in Web Interfaces?" arXiv:2605.29615, 2026.

  • Scry Design Diff Eval repository artifacts used by this draft:

docs/study_results.md, docs/openrouter_15x4_results.md, docs/study_manifest.md, reports/study_manifest.json, reports/pilot/idf_pilot_15x4.jsonl, results/*-pilot/scores.json, scripts/make_study_manifest.py, scripts/check_study_manifest.py, scripts/check_report_numbers.py, and scripts/summarize_audit_decisions.py.