WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation
Problem Statement
Current VLM-based UI-to-Code systems only generate static layouts, failing to capture interactive behaviors like button clicks, dropdowns, and state transitions that are essential for functional UI development. This gap makes AI-generated code insufficient for real-world deployment and forces developers to manually add interactivity. There is also no established pipeline for verifying that generated UI code behaves correctly across multiple states.
Key Novelty
- First agentic framework specifically designed for interactive (multi-state) UI-to-Code generation, moving beyond single-screenshot static HTML/CSS generation
- A specialized UI exploration agent that captures multi-state UI screenshots more stably and accurately than general-purpose agents like Gemini-2.5-Pro
- A validation module that automatically verifies the interactivity of generated code, enabling a closed-loop generation-and-verification pipeline
Evaluation Highlights
- WebVIA-Agent achieves more stable and accurate UI exploration compared to general-purpose agents such as Gemini-2.5-Pro on multi-state UI capture tasks
- Fine-tuned WebVIA-UI2Code models outperform their base model counterparts on both interactive UI2Code benchmarks and static UI2Code benchmarks
Breakthrough Assessment
Methodology
- Step 1 - UI Exploration: A specialized exploration agent navigates web UIs to capture multi-state screenshots representing different interactive states (e.g., hover, click, dropdown open), building a richer visual specification than a single screenshot
- Step 2 - Code Generation: A fine-tuned UI2Code model takes the multi-state visual inputs and generates executable HTML/CSS/JavaScript code that includes interactive behaviors, not just static layouts
- Step 3 - Validation: A validation module renders the generated code and verifies that the specified interactive behaviors (state transitions, element responses) are correctly implemented, providing feedback for correction if needed
System Components
A web-based agent that systematically interacts with UI mockups or live pages to capture screenshots across multiple interaction states, providing richer ground-truth visual context than single-image approaches
A fine-tuned Vision-Language Model that processes multi-state UI screenshots and generates executable, interactive HTML/CSS/JavaScript code including event handlers and dynamic behaviors
An automated verification component that renders the generated code in a browser environment and checks whether the interactive behaviors match the intended UI states, closing the generation loop
Results
| Metric/Benchmark | Baseline | This Paper | Delta |
|---|---|---|---|
| Multi-state UI Exploration Stability | Gemini-2.5-Pro (general-purpose agent) | WebVIA-Agent | More stable and accurate |
| Interactive UI2Code Benchmark | Base VLM (pre-fine-tuning) | WebVIA-UI2Code (fine-tuned) | Substantial improvement |
| Static UI2Code Benchmark | Base VLM (pre-fine-tuning) | WebVIA-UI2Code (fine-tuned) | Improved performance |
| Code Executability | Base VLM outputs | WebVIA-UI2Code outputs | Higher executable code rate |
Key Takeaways
- Domain-specific fine-tuning of VLMs on multi-state UI data significantly outperforms prompting general-purpose frontier models for interactive code generation tasks, suggesting specialized models are worth training for UI automation workflows
- Automated validation/verification modules are critical for agentic code generation pipelines—without them, there is no reliable way to confirm that generated interactive behaviors are functionally correct
- Multi-state visual context (capturing UI across different interaction states) is a key ingredient for generating interactive code; practitioners building UI automation tools should invest in richer data collection beyond single-screenshot inputs
Abstract
User interface (UI) development requires translating design mockups into functional code, a process that remains repetitive and labor-intensive. While recent Vision-Language Models (VLMs) automate UI-to-Code generation, they generate only static HTML/CSS/JavaScript layouts lacking interactivity. To address this, we propose WebVIA, the first agentic framework for interactive UI-to-Code generation and validation. The framework comprises three components: 1) an exploration agent to capture multi-state UI screenshots; 2) a UI2Code model that generates executable interactive code; 3) a validation module that verifies the interactivity. Experiments demonstrate that WebVIA-Agent achieves more stable and accurate UI exploration than general-purpose agents (e.g., Gemini-2.5-Pro). In addition, our fine-tuned WebVIA-UI2Code models exhibit substantial improvements in generating executable and interactive HTML/CSS/JavaScript code, outperforming their base counterparts across both interactive and static UI2Code benchmarks. Our code and models are available at \href{https://zheny2751-dotcom.github.io/webvia.github.io/}{\texttt{https://webvia.github.io}}.