UI-UG: A Unified MLLM for UI Understanding and Generation

UI-UG is a unified multimodal large language model that jointly tackles UI understanding and generation tasks, using a combination of SFT, GRPO, and DPO training strategies to achieve state-of-the-art performance at reduced computational cost.

Problem Statement

Existing MLLMs struggle with domain-specific UI tasks, exhibiting poor fine-grained understanding of complex modern UIs and generating low-quality or non-human-preferred UI outputs. Prior approaches treat UI understanding and generation as separate problems, missing the synergistic benefits of joint training. Industrial deployment is further hindered by the lack of practical workflows covering DSL design, rendering, and evaluation.

Key Novelty

Unified model combining UI understanding and generation in a single MLLM, demonstrating that joint training improves both tasks simultaneously
Application of GRPO (reinforcement learning-based optimization) on top of SFT specifically for fine-grained UI understanding on complex modern UI data
End-to-end industrially-oriented workflow including an LLM-friendly domain-specific language (DSL) design, rendering pipeline, and tailored evaluation metrics for UI generation quality

Evaluation Highlights

Achieves SOTA on UI understanding benchmarks, outperforming both larger general-purpose MLLMs and similarly-sized UI-specialized models
UI generation quality is on par with larger MLLMs at a fraction of the computational cost, validated through DPO-aligned human preference metrics

Breakthrough Assessment

6/10 UI-UG makes a solid contribution by unifying UI understanding and generation with a well-engineered training pipeline, but it is largely an application of existing techniques (SFT, GRPO, DPO) to a specific domain rather than a fundamentally new algorithmic advance.

Methodology

Step 1 - Data & DSL Design: Curate complex modern UI datasets and design an LLM-friendly domain-specific language (DSL) to represent UI structures in a way amenable to language model training
Step 2 - Understanding Training: Apply Supervised Fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) to enhance fine-grained UI comprehension, grounding, and element recognition
Step 3 - Generation Alignment: Apply Direct Preference Optimization (DPO) to align UI generation outputs with human preferences, followed by a rendering pipeline that converts model outputs to visual UIs and evaluation using domain-specific metrics

System Components

SFT Module

Supervised fine-tuning on curated UI understanding and generation data to establish baseline domain knowledge in the MLLM

GRPO Training

Group Relative Policy Optimization applied after SFT to further refine fine-grained UI understanding through reinforcement learning signals

DPO Alignment

Direct Preference Optimization used for the generation task to ensure model outputs match human-preferred UI designs

LLM-friendly DSL

A domain-specific language designed to represent UI layouts and components in a format that is easily parsed and generated by language models

Rendering Pipeline

A process that converts the model's DSL output into rendered visual UI images for evaluation and deployment

Domain-specific Evaluation Metrics

Custom metrics tailored to assess both UI understanding accuracy and generation quality beyond generic MLLM benchmarks

Results

Metric/Benchmark	Baseline (larger MLLMs / UI-specialized)	UI-UG	Delta
UI Understanding (fine-grained)	Below UI-UG (general MLLMs larger in size)	SOTA	Outperforms larger general MLLMs
UI Understanding (UI-specialized models)	Similar size, prior SOTA	SOTA	Outperforms similarly-sized specialists
UI Generation Quality	On par (larger MLLMs)	On par	Matched at fraction of compute cost
Unified Task Synergy	Separate understanding/generation models	Joint model improves both	Positive transfer between tasks

Key Takeaways

Joint training of understanding and generation in a single MLLM can improve both tasks, suggesting that practitioners should consider unified architectures rather than siloed models for related UI tasks
GRPO applied after SFT is an effective strategy for fine-grained visual understanding in domain-specific settings, and ML engineers should consider RL-based fine-tuning when SFT alone plateaus on complex perception tasks
Designing an LLM-friendly DSL is a critical practical step when adapting MLLMs to structured generation domains like UI; the choice of output representation significantly impacts both trainability and rendering fidelity

Abstract

Although Multimodal Large Language Models (MLLMs) have been widely applied across domains, they are still facing challenges in domain-specific tasks, such as User Interface (UI) understanding accuracy and UI generation quality. In this paper, we introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities. For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding on the modern complex UI data. For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs. In addition, we propose an industrially effective workflow, including the design of an LLM-friendly domain-specific language (DSL), training strategies, rendering processes, and evaluation metrics. In experiments, our model achieves state-of-the-art (SOTA) performance on understanding tasks, outperforming both larger general-purpose MLLMs and similarly-sized UI-specialized models. Our model is also on par with these larger MLLMs in UI generation performance at a fraction of the computational cost. We also demonstrate that integrating understanding and generation tasks can improve accuracy and quality for both tasks. Code and Model: https://github.com/neovateai/UI-UG