ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Shihui Hu, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Haotian Zhu, Yuanxing Zhang, Yuhao Jiang, Yue Zhang, Zenan Xu, Bohui Zhai, Guoxiang He, Hebin Li, Jie Zhao, Le Zhang, Lingyun Tan, Pengyu Guo, Xianshu Pang, Y. Ruan, Zhifeng Zhang, Zhonghu Wang, Zi-Jian Xu, Zuopu Yin, Wiggin Zhou, Chayse Zhou, Fengzong Lian

arXiv.org | 2025

Semantic Scholar

LLMs Multimodal Benchmark

ArtifactsBench introduces a multimodal, automated evaluation framework for LLM-generated visual and interactive code artifacts, using programmatic rendering and MLLM-as-Judge scoring to assess visual fidelity and interactive integrity beyond algorithmic correctness.

Problem Statement

Existing code generation benchmarks (e.g., HumanEval, MBPP) evaluate only algorithmic correctness and are blind to visual quality and interactivity—properties that define modern web and UI artifacts. As LLMs increasingly generate dynamic, interactive front-end code, there is no scalable, automated way to measure whether the output actually looks and behaves correctly from a user's perspective. This gap means developers lack reliable signals to compare and improve models for real-world, user-centric generative tasks.

Key Novelty

First benchmark specifically designed for evaluating visual fidelity and interactive integrity of LLM-generated code artifacts, with 1,825 diverse tasks covering dynamic and interactive web content
A programmatic render-and-capture pipeline that executes generated code in a browser, takes temporal screenshots to capture dynamic behavior, and feeds this visual evidence to an MLLM-as-Judge with fine-grained per-task checklists
Achieves 94.4% ranking consistency with WebDev Arena (human preference gold standard) and >90% pairwise agreement with human experts, establishing the first scalable automated proxy for human-perceived quality in visual code generation

Evaluation Highlights

94.4% ranking consistency with WebDev Arena across 30+ leading LLMs, validating the framework as a reliable automated proxy for human preference
Over 90% pairwise agreement with human expert evaluations, demonstrating high reliability and reproducibility of the MLLM-as-Judge scoring paradigm

Breakthrough Assessment

7/10 ArtifactsBench fills a genuine and practically important evaluation gap by bridging visual-interactive quality assessment with automated scalability, achieving near-human agreement; however, it is primarily a benchmarking/evaluation contribution rather than a new modeling or training advance, placing it as a significant contribution rather than a paradigm shift.

Methodology

Curate 1,825 diverse visual code generation tasks spanning static layouts, animations, games, dashboards, and interactive UI components, each paired with a fine-grained evaluation checklist specifying correctness criteria
Programmatically render each LLM-generated artifact in a sandboxed browser environment and capture temporal screenshots at multiple time steps to record dynamic and interactive behavior
Pass the source code, temporal screenshots, and per-task checklist to a Multimodal LLM (MLLM)-as-Judge, which scores each artifact holistically on visual fidelity, functional correctness, and interactive integrity to produce reproducible rankings

System Components

Task Benchmark (1,825 tasks)

A curated set of diverse visual code generation prompts covering static, animated, and interactive web artifacts with associated fine-grained evaluation checklists

Render-and-Capture Pipeline

Programmatically executes generated code in a browser sandbox and takes temporal screenshots to document both static appearance and dynamic/interactive behavior over time

MLLM-as-Judge

A multimodal large language model evaluator guided by per-task checklists that scores artifacts based on visual screenshots and source code for holistic, reproducible quality assessment

Evaluation Harness

End-to-end open-source framework that orchestrates code generation, rendering, screenshot capture, and MLLM scoring, enabling scalable evaluation of any LLM

Results

Metric	Prior Automated Benchmarks	ArtifactsBench	Delta
Ranking consistency with WebDev Arena	Not reported / N/A	94.4%	+significant
Pairwise agreement with human experts	Not reported / N/A	>90%	+significant
Number of LLMs evaluated	Varies (~10-20 typical)	30+	+~10-20 models
Task coverage (visual/interactive)	Primarily algorithmic	1,825 visual/interactive tasks	New capability

Key Takeaways

ML practitioners building or comparing LLMs for front-end/UI code generation can now use ArtifactsBench as a scalable, human-correlated automated evaluation tool instead of relying on expensive human studies or misaligned algorithmic benchmarks
The MLLM-as-Judge paradigm with fine-grained per-task checklists and temporal screenshot evidence is a reusable design pattern applicable to other domains requiring visual or behavioral quality assessment beyond text correctness
Generalist frontier models frequently outperform domain-specific coding models on visual artifact quality, suggesting that broad multimodal training may be more valuable than narrow code specialization for user-facing generative applications

Abstract

The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at https://artifactsbenchmark.github.io/, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.

Generated on 2026-02-21 using Claude