← Back to Papers

WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation

Mingde Xu, Zhen Yang, Wenyi Hong, Lihang Pan, Xinyue Fan, Yan Wang, Xiaotao Gu, Bin Xu, Jie Tang
arXiv.org | 2025
WebVIA is the first agentic framework that goes beyond static UI-to-Code generation by capturing multi-state UI interactions, generating executable interactive HTML/CSS/JavaScript, and validating the resulting interactivity automatically.

Problem Statement

Current VLM-based UI-to-Code systems only generate static layouts, failing to capture interactive behaviors like button clicks, dropdowns, and state transitions that are essential for functional UI development. This gap makes AI-generated code insufficient for real-world deployment and forces developers to manually add interactivity. There is also no established pipeline for verifying that generated UI code behaves correctly across multiple states.

Key Novelty

  • First agentic framework specifically designed for interactive (multi-state) UI-to-Code generation, moving beyond single-screenshot static HTML/CSS generation
  • A specialized UI exploration agent that captures multi-state UI screenshots more stably and accurately than general-purpose agents like Gemini-2.5-Pro
  • A validation module that automatically verifies the interactivity of generated code, enabling a closed-loop generation-and-verification pipeline

Evaluation Highlights

  • WebVIA-Agent achieves more stable and accurate UI exploration compared to general-purpose agents such as Gemini-2.5-Pro on multi-state UI capture tasks
  • Fine-tuned WebVIA-UI2Code models outperform their base model counterparts on both interactive UI2Code benchmarks and static UI2Code benchmarks

Breakthrough Assessment

6/10 WebVIA addresses a real and underexplored gap—interactive UI code generation—with a well-structured three-component pipeline, but it is a solid engineering and fine-tuning contribution rather than a fundamental algorithmic advance; the approach builds on existing VLMs and agentic scaffolding rather than introducing new learning paradigms.

Methodology

  1. Step 1 - UI Exploration: A specialized exploration agent navigates web UIs to capture multi-state screenshots representing different interactive states (e.g., hover, click, dropdown open), building a richer visual specification than a single screenshot
  2. Step 2 - Code Generation: A fine-tuned UI2Code model takes the multi-state visual inputs and generates executable HTML/CSS/JavaScript code that includes interactive behaviors, not just static layouts
  3. Step 3 - Validation: A validation module renders the generated code and verifies that the specified interactive behaviors (state transitions, element responses) are correctly implemented, providing feedback for correction if needed

System Components

Exploration Agent

A web-based agent that systematically interacts with UI mockups or live pages to capture screenshots across multiple interaction states, providing richer ground-truth visual context than single-image approaches

UI2Code Model (WebVIA-UI2Code)

A fine-tuned Vision-Language Model that processes multi-state UI screenshots and generates executable, interactive HTML/CSS/JavaScript code including event handlers and dynamic behaviors

Validation Module

An automated verification component that renders the generated code in a browser environment and checks whether the interactive behaviors match the intended UI states, closing the generation loop

Results

Metric/Benchmark Baseline This Paper Delta
Multi-state UI Exploration Stability Gemini-2.5-Pro (general-purpose agent) WebVIA-Agent More stable and accurate
Interactive UI2Code Benchmark Base VLM (pre-fine-tuning) WebVIA-UI2Code (fine-tuned) Substantial improvement
Static UI2Code Benchmark Base VLM (pre-fine-tuning) WebVIA-UI2Code (fine-tuned) Improved performance
Code Executability Base VLM outputs WebVIA-UI2Code outputs Higher executable code rate

Key Takeaways

  • Domain-specific fine-tuning of VLMs on multi-state UI data significantly outperforms prompting general-purpose frontier models for interactive code generation tasks, suggesting specialized models are worth training for UI automation workflows
  • Automated validation/verification modules are critical for agentic code generation pipelines—without them, there is no reliable way to confirm that generated interactive behaviors are functionally correct
  • Multi-state visual context (capturing UI across different interaction states) is a key ingredient for generating interactive code; practitioners building UI automation tools should invest in richer data collection beyond single-screenshot inputs

Abstract

User interface (UI) development requires translating design mockups into functional code, a process that remains repetitive and labor-intensive. While recent Vision-Language Models (VLMs) automate UI-to-Code generation, they generate only static HTML/CSS/JavaScript layouts lacking interactivity. To address this, we propose WebVIA, the first agentic framework for interactive UI-to-Code generation and validation. The framework comprises three components: 1) an exploration agent to capture multi-state UI screenshots; 2) a UI2Code model that generates executable interactive code; 3) a validation module that verifies the interactivity. Experiments demonstrate that WebVIA-Agent achieves more stable and accurate UI exploration than general-purpose agents (e.g., Gemini-2.5-Pro). In addition, our fine-tuned WebVIA-UI2Code models exhibit substantial improvements in generating executable and interactive HTML/CSS/JavaScript code, outperforming their base counterparts across both interactive and static UI2Code benchmarks. Our code and models are available at \href{https://zheny2751-dotcom.github.io/webvia.github.io/}{\texttt{https://webvia.github.io}}.

Generated on 2026-03-02 using Claude