UI-UG: A Unified MLLM for UI Understanding and Generation
Problem Statement
Existing MLLMs struggle with domain-specific UI tasks, exhibiting poor fine-grained understanding of complex modern UIs and generating low-quality or non-human-preferred UI outputs. Prior approaches treat UI understanding and generation as separate problems, missing the synergistic benefits of joint training. Industrial deployment is further hindered by the lack of practical workflows covering DSL design, rendering, and evaluation.
Key Novelty
- Unified model combining UI understanding and generation in a single MLLM, demonstrating that joint training improves both tasks simultaneously
- Application of GRPO (reinforcement learning-based optimization) on top of SFT specifically for fine-grained UI understanding on complex modern UI data
- End-to-end industrially-oriented workflow including an LLM-friendly domain-specific language (DSL) design, rendering pipeline, and tailored evaluation metrics for UI generation quality
Evaluation Highlights
- Achieves SOTA on UI understanding benchmarks, outperforming both larger general-purpose MLLMs and similarly-sized UI-specialized models
- UI generation quality is on par with larger MLLMs at a fraction of the computational cost, validated through DPO-aligned human preference metrics
Breakthrough Assessment
Methodology
- Step 1 - Data & DSL Design: Curate complex modern UI datasets and design an LLM-friendly domain-specific language (DSL) to represent UI structures in a way amenable to language model training
- Step 2 - Understanding Training: Apply Supervised Fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) to enhance fine-grained UI comprehension, grounding, and element recognition
- Step 3 - Generation Alignment: Apply Direct Preference Optimization (DPO) to align UI generation outputs with human preferences, followed by a rendering pipeline that converts model outputs to visual UIs and evaluation using domain-specific metrics
System Components
Supervised fine-tuning on curated UI understanding and generation data to establish baseline domain knowledge in the MLLM
Group Relative Policy Optimization applied after SFT to further refine fine-grained UI understanding through reinforcement learning signals
Direct Preference Optimization used for the generation task to ensure model outputs match human-preferred UI designs
A domain-specific language designed to represent UI layouts and components in a format that is easily parsed and generated by language models
A process that converts the model's DSL output into rendered visual UI images for evaluation and deployment
Custom metrics tailored to assess both UI understanding accuracy and generation quality beyond generic MLLM benchmarks
Results
| Metric/Benchmark | Baseline (larger MLLMs / UI-specialized) | UI-UG | Delta |
|---|---|---|---|
| UI Understanding (fine-grained) | Below UI-UG (general MLLMs larger in size) | SOTA | Outperforms larger general MLLMs |
| UI Understanding (UI-specialized models) | Similar size, prior SOTA | SOTA | Outperforms similarly-sized specialists |
| UI Generation Quality | On par (larger MLLMs) | On par | Matched at fraction of compute cost |
| Unified Task Synergy | Separate understanding/generation models | Joint model improves both | Positive transfer between tasks |
Key Takeaways
- Joint training of understanding and generation in a single MLLM can improve both tasks, suggesting that practitioners should consider unified architectures rather than siloed models for related UI tasks
- GRPO applied after SFT is an effective strategy for fine-grained visual understanding in domain-specific settings, and ML engineers should consider RL-based fine-tuning when SFT alone plateaus on complex perception tasks
- Designing an LLM-friendly DSL is a critical practical step when adapting MLLMs to structured generation domains like UI; the choice of output representation significantly impacts both trainability and rendering fidelity
Abstract
Although Multimodal Large Language Models (MLLMs) have been widely applied across domains, they are still facing challenges in domain-specific tasks, such as User Interface (UI) understanding accuracy and UI generation quality. In this paper, we introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities. For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding on the modern complex UI data. For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs. In addition, we propose an industrially effective workflow, including the design of an LLM-friendly domain-specific language (DSL), training strategies, rendering processes, and evaluation metrics. In experiments, our model achieves state-of-the-art (SOTA) performance on understanding tasks, outperforming both larger general-purpose MLLMs and similarly-sized UI-specialized models. Our model is also on par with these larger MLLMs in UI generation performance at a fraction of the computational cost. We also demonstrate that integrating understanding and generation tasks can improve accuracy and quality for both tasks. Code and Model: https://github.com/neovateai/UI-UG