WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation

WebVIA is the first agentic framework that goes beyond static UI-to-Code generation by capturing multi-state UI interactions, generating executable interactive HTML/CSS/JavaScript, and validating the resulting interactivity automatically.

Problem Statement

Current VLM-based UI-to-Code systems only generate static layouts, failing to capture interactive behaviors like button clicks, dropdowns, and state transitions that are essential for functional UI development. This gap makes AI-generated code insufficient for real-world deployment and forces developers to manually add interactivity. There is also no established pipeline for verifying that generated UI code behaves correctly across multiple states.

Key Novelty

First agentic framework specifically designed for interactive (multi-state) UI-to-Code generation, moving beyond single-screenshot static HTML/CSS generation
A specialized UI exploration agent that captures multi-state UI screenshots more stably and accurately than general-purpose agents like Gemini-2.5-Pro
A validation module that automatically verifies the interactivity of generated code, enabling a closed-loop generation-and-verification pipeline

Evaluation Highlights

WebVIA-Agent achieves more stable and accurate UI exploration compared to general-purpose agents such as Gemini-2.5-Pro on multi-state UI capture tasks
Fine-tuned WebVIA-UI2Code models outperform their base model counterparts on both interactive UI2Code benchmarks and static UI2Code benchmarks

Signal Assessment

6/10 WebVIA addresses a real and underexplored gap—interactive UI code generation—with a well-structured three-component pipeline, but it is a solid engineering and fine-tuning contribution rather than a fundamental algorithmic advance; the approach builds on existing VLMs and agentic scaffolding rather than introducing new learning paradigms.

Methodology

Step 1 - UI Exploration: A specialized exploration agent navigates web UIs to capture multi-state screenshots representing different interactive states (e.g., hover, click, dropdown open), building a richer visual specification than a single screenshot
Step 2 - Code Generation: A fine-tuned UI2Code model takes the multi-state visual inputs and generates executable HTML/CSS/JavaScript code that includes interactive behaviors, not just static layouts
Step 3 - Validation: A validation module renders the generated code and verifies that the specified interactive behaviors (state transitions, element responses) are correctly implemented, providing feedback for correction if needed

System Components

Exploration Agent

A web-based agent that systematically interacts with UI mockups or live pages to capture screenshots across multiple interaction states, providing richer ground-truth visual context than single-image approaches

UI2Code Model (WebVIA-UI2Code)

A fine-tuned Vision-Language Model that processes multi-state UI screenshots and generates executable, interactive HTML/CSS/JavaScript code including event handlers and dynamic behaviors

Validation Module

An automated verification component that renders the generated code in a browser environment and checks whether the interactive behaviors match the intended UI states, closing the generation loop

Results

Metric/Benchmark	Baseline	This Paper	Delta
Multi-state UI Exploration Stability	Gemini-2.5-Pro (general-purpose agent)	WebVIA-Agent	More stable and accurate
Interactive UI2Code Benchmark	Base VLM (pre-fine-tuning)	WebVIA-UI2Code (fine-tuned)	Substantial improvement
Static UI2Code Benchmark	Base VLM (pre-fine-tuning)	WebVIA-UI2Code (fine-tuned)	Improved performance
Code Executability	Base VLM outputs	WebVIA-UI2Code outputs	Higher executable code rate

Key Takeaways

Domain-specific fine-tuning of VLMs on multi-state UI data significantly outperforms prompting general-purpose frontier models for interactive code generation tasks, suggesting specialized models are worth training for UI automation workflows
Automated validation/verification modules are critical for agentic code generation pipelines—without them, there is no reliable way to confirm that generated interactive behaviors are functionally correct
Multi-state visual context (capturing UI across different interaction states) is a key ingredient for generating interactive code; practitioners building UI automation tools should invest in richer data collection beyond single-screenshot inputs

Abstract

User interface (UI) development requires translating design mockups into functional code, a process that remains repetitive and labor-intensive. While recent Vision-Language Models (VLMs) automate UI-to-Code generation, they generate only static HTML/CSS/JavaScript layouts lacking interactivity. To address this, we propose WebVIA, the first agentic framework for interactive UI-to-Code generation and validation. The framework comprises three components: 1) an exploration agent to capture multi-state UI screenshots; 2) a UI2Code model that generates executable interactive code; 3) a validation module that verifies the interactivity. Experiments demonstrate that WebVIA-Agent achieves more stable and accurate UI exploration than general-purpose agents (e.g., Gemini-2.5-Pro). In addition, our fine-tuned WebVIA-UI2Code models exhibit substantial improvements in generating executable and interactive HTML/CSS/JavaScript code, outperforming their base counterparts across both interactive and static UI2Code benchmarks. Our code and models are available at \href{https://zheny2751-dotcom.github.io/webvia.github.io/}{\texttt{https://webvia.github.io}}.