Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications
Problem Statement
Large Reasoning Models (LRMs) suffer from 'overthinking' — generating excessively long chain-of-thought outputs that waste tokens and increase latency without proportional accuracy gains. Existing enterprise-focused LLMs often lack strong multimodal capabilities, RAG performance, and complex table understanding. There is also a gap between open-source models and proprietary frontier models on combined reasoning and enterprise tasks.
Key Novelty
- Reflection-aware Adaptive Policy Optimization (RAPO): a novel RL training algorithm that explicitly detects and penalizes overthinking behaviors in reasoning models, enabling adaptive token efficiency
- Enterprise-optimized MoE multimodal architecture: a 40B total / 3.7B activated parameter model specifically tuned for RAG, complex table understanding, and summarization alongside general reasoning
- Efficient reasoning parity: achieves accuracy comparable to frontier models on math and science benchmarks while using only ~1/4 to 1/2 the average token count, demonstrating practical deployment efficiency
Evaluation Highlights
- Achieves reasoning accuracy comparable to frontier models on math and science benchmarks while requiring only approximately 1/4 to 1/2 of the average tokens generated
- Consistently achieves superior performance over comparable open-source models on enterprise tasks including RAG, complex table understanding, and summarization
Breakthrough Assessment
Methodology
- Design a sparse MoE multimodal architecture with 40B total parameters but only 3.7B activated per forward pass, balancing capability with inference efficiency for enterprise deployment
- Train with Reflection-aware Adaptive Policy Optimization (RAPO), an RL algorithm that monitors chain-of-thought reflection patterns and applies adaptive penalties to discourage unnecessary token generation while preserving reasoning accuracy
- Fine-tune and evaluate on both enterprise-specific benchmarks (RAG, table understanding, summarization) and general reasoning benchmarks (math, science) to validate dual-purpose capability and token efficiency gains
System Components
Sparse Mixture-of-Experts language model with 40B total parameters but only 3.7B activated per token, enabling high-capacity modeling at low inference cost
Vision and modality encoding components that allow the model to process images and other non-text inputs alongside text for enterprise multimodal tasks
Novel RL training algorithm that identifies overthinking behaviors (excessive reflection loops) in reasoning chains and applies adaptive policy updates to reduce unnecessary token generation without sacrificing accuracy
Specialized training data and evaluation pipelines targeting RAG, complex table understanding, and document summarization to ensure strong performance on business-critical applications
Results
| Benchmark/Task | Frontier Model Baseline | Yuan3.0 Flash | Delta |
|---|---|---|---|
| Math/Science Reasoning Accuracy | Frontier model level (100%) | Comparable to frontier | ~0% accuracy loss |
| Average Token Usage (Reasoning) | Frontier model baseline | 1/4 to 1/2 of baseline | 50-75% token reduction |
| RAG Performance | Comparable open-source models | Superior | Positive improvement |
| Complex Table Understanding | Comparable open-source models | Superior | Positive improvement |
| Summarization | Comparable open-source models | Superior | Positive improvement |
Key Takeaways
- RAPO offers a practical RL-based solution to the overthinking problem that practitioners can study and potentially adapt to reduce inference costs in their own reasoning models without significant accuracy degradation
- The 40B total / 3.7B activated MoE design is a strong reference point for teams building enterprise LLMs on a compute budget — sparse activation enables frontier-class capacity at a fraction of the per-token FLOPs
- Full open-source release (weights + code) makes Yuan3.0 Flash directly deployable for enterprise RAG pipelines, table-heavy document processing, and summarization use cases where token efficiency and cost matter
Abstract
We introduce Yuan3.0 Flash, an open-source Mixture-of-Experts (MoE) MultiModal Large Language Model featuring 3.7B activated parameters and 40B total parameters, specifically designed to enhance performance on enterprise-oriented tasks while maintaining competitive capabilities on general-purpose tasks. To address the overthinking phenomenon commonly observed in Large Reasoning Models (LRMs), we propose Reflection-aware Adaptive Policy Optimization (RAPO), a novel RL training algorithm that effectively regulates overthinking behaviors. In enterprise-oriented tasks such as retrieval-augmented generation (RAG), complex table understanding, and summarization, Yuan3.0 Flash consistently achieves superior performance. Moreover, it also demonstrates strong reasoning capabilities in domains such as mathematics, science, etc., attaining accuracy comparable to frontier model while requiring only approximately 1/4 to 1/2 of the average tokens. Yuan3.0 Flash has been fully open-sourced to facilitate further research and real-world deployment: https://github.com/Yuan-lab-LLM/Yuan3.0.