Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications

Shawn Wu, Sean Wang, Louie Li, Darcy Chen, Allen Wang, Jiangang Luo, Xudong Zhao, Joseph Shen, Gawain Ma, J. Jia, Marcus Mao, Claire Wang, Hunter He, Carolyn Wang, Z. Zhang, Jason Wang, C. Shen, Leo Zhang, Logan Chen, Qasim Meng, J. Gong, Dan Zhao, Penn Zheng, O. Zhu, Tong Yu

arXiv.org | 2026

Semantic Scholar

LLMs RAG Reasoning & Planning Multimodal

Yuan3.0 Flash is an open-source MoE multimodal LLM with 3.7B activated / 40B total parameters that achieves frontier-level reasoning and strong enterprise task performance while reducing token usage by 50-75% through a novel RL algorithm called RAPO that mitigates overthinking in reasoning models.

Problem Statement

Large Reasoning Models (LRMs) suffer from 'overthinking' — generating excessively long chain-of-thought outputs that waste tokens and increase latency without proportional accuracy gains. Existing enterprise-focused LLMs often lack strong multimodal capabilities, RAG performance, and complex table understanding. There is also a gap between open-source models and proprietary frontier models on combined reasoning and enterprise tasks.

Key Novelty

Reflection-aware Adaptive Policy Optimization (RAPO): a novel RL training algorithm that explicitly detects and penalizes overthinking behaviors in reasoning models, enabling adaptive token efficiency
Enterprise-optimized MoE multimodal architecture: a 40B total / 3.7B activated parameter model specifically tuned for RAG, complex table understanding, and summarization alongside general reasoning
Efficient reasoning parity: achieves accuracy comparable to frontier models on math and science benchmarks while using only ~1/4 to 1/2 the average token count, demonstrating practical deployment efficiency

Evaluation Highlights

Achieves reasoning accuracy comparable to frontier models on math and science benchmarks while requiring only approximately 1/4 to 1/2 of the average tokens generated
Consistently achieves superior performance over comparable open-source models on enterprise tasks including RAG, complex table understanding, and summarization

Signal Assessment

6/10 RAPO is a meaningful and practical contribution to the overthinking problem in LRMs, and the enterprise-focused multimodal MoE design fills a real gap; however, the architecture and training paradigm are evolutionary rather than paradigm-shifting, and the scale (40B total params) is within the current mainstream range.

Methodology

Design a sparse MoE multimodal architecture with 40B total parameters but only 3.7B activated per forward pass, balancing capability with inference efficiency for enterprise deployment
Train with Reflection-aware Adaptive Policy Optimization (RAPO), an RL algorithm that monitors chain-of-thought reflection patterns and applies adaptive penalties to discourage unnecessary token generation while preserving reasoning accuracy
Fine-tune and evaluate on both enterprise-specific benchmarks (RAG, table understanding, summarization) and general reasoning benchmarks (math, science) to validate dual-purpose capability and token efficiency gains

System Components

MoE Backbone (40B/3.7B)

Sparse Mixture-of-Experts language model with 40B total parameters but only 3.7B activated per token, enabling high-capacity modeling at low inference cost

Multimodal Encoder

Vision and modality encoding components that allow the model to process images and other non-text inputs alongside text for enterprise multimodal tasks

RAPO (Reflection-aware Adaptive Policy Optimization)

Novel RL training algorithm that identifies overthinking behaviors (excessive reflection loops) in reasoning chains and applies adaptive policy updates to reduce unnecessary token generation without sacrificing accuracy

Enterprise Task Modules

Specialized training data and evaluation pipelines targeting RAG, complex table understanding, and document summarization to ensure strong performance on business-critical applications

Results

Benchmark/Task	Frontier Model Baseline	Yuan3.0 Flash	Delta
Math/Science Reasoning Accuracy	Frontier model level (100%)	Comparable to frontier	~0% accuracy loss
Average Token Usage (Reasoning)	Frontier model baseline	1/4 to 1/2 of baseline	50-75% token reduction
RAG Performance	Comparable open-source models	Superior	Positive improvement
Complex Table Understanding	Comparable open-source models	Superior	Positive improvement
Summarization	Comparable open-source models	Superior	Positive improvement

Key Takeaways

RAPO offers a practical RL-based solution to the overthinking problem that practitioners can study and potentially adapt to reduce inference costs in their own reasoning models without significant accuracy degradation
The 40B total / 3.7B activated MoE design is a strong reference point for teams building enterprise LLMs on a compute budget — sparse activation enables frontier-class capacity at a fraction of the per-token FLOPs
Full open-source release (weights + code) makes Yuan3.0 Flash directly deployable for enterprise RAG pipelines, table-heavy document processing, and summarization use cases where token efficiency and cost matter

Abstract

We introduce Yuan3.0 Flash, an open-source Mixture-of-Experts (MoE) MultiModal Large Language Model featuring 3.7B activated parameters and 40B total parameters, specifically designed to enhance performance on enterprise-oriented tasks while maintaining competitive capabilities on general-purpose tasks. To address the overthinking phenomenon commonly observed in Large Reasoning Models (LRMs), we propose Reflection-aware Adaptive Policy Optimization (RAPO), a novel RL training algorithm that effectively regulates overthinking behaviors. In enterprise-oriented tasks such as retrieval-augmented generation (RAG), complex table understanding, and summarization, Yuan3.0 Flash consistently achieves superior performance. Moreover, it also demonstrates strong reasoning capabilities in domains such as mathematics, science, etc., attaining accuracy comparable to frontier model while requiring only approximately 1/4 to 1/2 of the average tokens. Yuan3.0 Flash has been fully open-sourced to facilitate further research and real-world deployment: https://github.com/Yuan-lab-LLM/Yuan3.0.

Generated from available metadata and abstract on 2026-03-02 using Claude.