← Back to Papers

Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications

Shawn Wu, Sean Wang, Louie Li, Darcy Chen, Allen Wang, Jiangang Luo, Xudong Zhao, Joseph Shen, Gawain Ma, J. Jia, Marcus Mao, Claire Wang, Hunter He, Carolyn Wang, Z. Zhang, Jason Wang, C. Shen, Leo Zhang, Logan Chen, Qasim Meng, J. Gong, Dan Zhao, Penn Zheng, O. Zhu, Tong Yu
arXiv.org | 2026
Yuan3.0 Flash is an open-source MoE multimodal LLM with 3.7B activated / 40B total parameters that achieves frontier-level reasoning and strong enterprise task performance while reducing token usage by 50-75% through a novel RL algorithm called RAPO that mitigates overthinking in reasoning models.

Problem Statement

Large Reasoning Models (LRMs) suffer from 'overthinking' — generating excessively long chain-of-thought outputs that waste tokens and increase latency without proportional accuracy gains. Existing enterprise-focused LLMs often lack strong multimodal capabilities, RAG performance, and complex table understanding. There is also a gap between open-source models and proprietary frontier models on combined reasoning and enterprise tasks.

Key Novelty

  • Reflection-aware Adaptive Policy Optimization (RAPO): a novel RL training algorithm that explicitly detects and penalizes overthinking behaviors in reasoning models, enabling adaptive token efficiency
  • Enterprise-optimized MoE multimodal architecture: a 40B total / 3.7B activated parameter model specifically tuned for RAG, complex table understanding, and summarization alongside general reasoning
  • Efficient reasoning parity: achieves accuracy comparable to frontier models on math and science benchmarks while using only ~1/4 to 1/2 the average token count, demonstrating practical deployment efficiency

Evaluation Highlights

  • Achieves reasoning accuracy comparable to frontier models on math and science benchmarks while requiring only approximately 1/4 to 1/2 of the average tokens generated
  • Consistently achieves superior performance over comparable open-source models on enterprise tasks including RAG, complex table understanding, and summarization

Breakthrough Assessment

6/10 RAPO is a meaningful and practical contribution to the overthinking problem in LRMs, and the enterprise-focused multimodal MoE design fills a real gap; however, the architecture and training paradigm are evolutionary rather than paradigm-shifting, and the scale (40B total params) is within the current mainstream range.

Methodology

  1. Design a sparse MoE multimodal architecture with 40B total parameters but only 3.7B activated per forward pass, balancing capability with inference efficiency for enterprise deployment
  2. Train with Reflection-aware Adaptive Policy Optimization (RAPO), an RL algorithm that monitors chain-of-thought reflection patterns and applies adaptive penalties to discourage unnecessary token generation while preserving reasoning accuracy
  3. Fine-tune and evaluate on both enterprise-specific benchmarks (RAG, table understanding, summarization) and general reasoning benchmarks (math, science) to validate dual-purpose capability and token efficiency gains

System Components

MoE Backbone (40B/3.7B)

Sparse Mixture-of-Experts language model with 40B total parameters but only 3.7B activated per token, enabling high-capacity modeling at low inference cost

Multimodal Encoder

Vision and modality encoding components that allow the model to process images and other non-text inputs alongside text for enterprise multimodal tasks

RAPO (Reflection-aware Adaptive Policy Optimization)

Novel RL training algorithm that identifies overthinking behaviors (excessive reflection loops) in reasoning chains and applies adaptive policy updates to reduce unnecessary token generation without sacrificing accuracy

Enterprise Task Modules

Specialized training data and evaluation pipelines targeting RAG, complex table understanding, and document summarization to ensure strong performance on business-critical applications

Results

Benchmark/Task Frontier Model Baseline Yuan3.0 Flash Delta
Math/Science Reasoning Accuracy Frontier model level (100%) Comparable to frontier ~0% accuracy loss
Average Token Usage (Reasoning) Frontier model baseline 1/4 to 1/2 of baseline 50-75% token reduction
RAG Performance Comparable open-source models Superior Positive improvement
Complex Table Understanding Comparable open-source models Superior Positive improvement
Summarization Comparable open-source models Superior Positive improvement

Key Takeaways

  • RAPO offers a practical RL-based solution to the overthinking problem that practitioners can study and potentially adapt to reduce inference costs in their own reasoning models without significant accuracy degradation
  • The 40B total / 3.7B activated MoE design is a strong reference point for teams building enterprise LLMs on a compute budget — sparse activation enables frontier-class capacity at a fraction of the per-token FLOPs
  • Full open-source release (weights + code) makes Yuan3.0 Flash directly deployable for enterprise RAG pipelines, table-heavy document processing, and summarization use cases where token efficiency and cost matter

Abstract

We introduce Yuan3.0 Flash, an open-source Mixture-of-Experts (MoE) MultiModal Large Language Model featuring 3.7B activated parameters and 40B total parameters, specifically designed to enhance performance on enterprise-oriented tasks while maintaining competitive capabilities on general-purpose tasks. To address the overthinking phenomenon commonly observed in Large Reasoning Models (LRMs), we propose Reflection-aware Adaptive Policy Optimization (RAPO), a novel RL training algorithm that effectively regulates overthinking behaviors. In enterprise-oriented tasks such as retrieval-augmented generation (RAG), complex table understanding, and summarization, Yuan3.0 Flash consistently achieves superior performance. Moreover, it also demonstrates strong reasoning capabilities in domains such as mathematics, science, etc., attaining accuracy comparable to frontier model while requiring only approximately 1/4 to 1/2 of the average tokens. Yuan3.0 Flash has been fully open-sourced to facilitate further research and real-world deployment: https://github.com/Yuan-lab-LLM/Yuan3.0.

Generated on 2026-03-02 using Claude