ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation
Problem Statement
Existing code generation benchmarks (e.g., HumanEval, MBPP) evaluate only algorithmic correctness and are blind to visual quality and interactivity—properties that define modern web and UI artifacts. As LLMs increasingly generate dynamic, interactive front-end code, there is no scalable, automated way to measure whether the output actually looks and behaves correctly from a user's perspective. This gap means developers lack reliable signals to compare and improve models for real-world, user-centric generative tasks.
Key Novelty
- First benchmark specifically designed for evaluating visual fidelity and interactive integrity of LLM-generated code artifacts, with 1,825 diverse tasks covering dynamic and interactive web content
- A programmatic render-and-capture pipeline that executes generated code in a browser, takes temporal screenshots to capture dynamic behavior, and feeds this visual evidence to an MLLM-as-Judge with fine-grained per-task checklists
- Achieves 94.4% ranking consistency with WebDev Arena (human preference gold standard) and >90% pairwise agreement with human experts, establishing the first scalable automated proxy for human-perceived quality in visual code generation
Evaluation Highlights
- 94.4% ranking consistency with WebDev Arena across 30+ leading LLMs, validating the framework as a reliable automated proxy for human preference
- Over 90% pairwise agreement with human expert evaluations, demonstrating high reliability and reproducibility of the MLLM-as-Judge scoring paradigm
Breakthrough Assessment
Methodology
- Curate 1,825 diverse visual code generation tasks spanning static layouts, animations, games, dashboards, and interactive UI components, each paired with a fine-grained evaluation checklist specifying correctness criteria
- Programmatically render each LLM-generated artifact in a sandboxed browser environment and capture temporal screenshots at multiple time steps to record dynamic and interactive behavior
- Pass the source code, temporal screenshots, and per-task checklist to a Multimodal LLM (MLLM)-as-Judge, which scores each artifact holistically on visual fidelity, functional correctness, and interactive integrity to produce reproducible rankings
System Components
A curated set of diverse visual code generation prompts covering static, animated, and interactive web artifacts with associated fine-grained evaluation checklists
Programmatically executes generated code in a browser sandbox and takes temporal screenshots to document both static appearance and dynamic/interactive behavior over time
A multimodal large language model evaluator guided by per-task checklists that scores artifacts based on visual screenshots and source code for holistic, reproducible quality assessment
End-to-end open-source framework that orchestrates code generation, rendering, screenshot capture, and MLLM scoring, enabling scalable evaluation of any LLM
Results
| Metric | Prior Automated Benchmarks | ArtifactsBench | Delta |
|---|---|---|---|
| Ranking consistency with WebDev Arena | Not reported / N/A | 94.4% | +significant |
| Pairwise agreement with human experts | Not reported / N/A | >90% | +significant |
| Number of LLMs evaluated | Varies (~10-20 typical) | 30+ | +~10-20 models |
| Task coverage (visual/interactive) | Primarily algorithmic | 1,825 visual/interactive tasks | New capability |
Key Takeaways
- ML practitioners building or comparing LLMs for front-end/UI code generation can now use ArtifactsBench as a scalable, human-correlated automated evaluation tool instead of relying on expensive human studies or misaligned algorithmic benchmarks
- The MLLM-as-Judge paradigm with fine-grained per-task checklists and temporal screenshot evidence is a reusable design pattern applicable to other domains requiring visual or behavioral quality assessment beyond text correctness
- Generalist frontier models frequently outperform domain-specific coding models on visual artifact quality, suggesting that broad multimodal training may be more valuable than narrow code specialization for user-facing generative applications
Abstract
The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at https://artifactsbenchmark.github.io/, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.