A.X K1 Technical Report
Problem Statement
Large language models often require separate models for different inference modes (e.g., chain-of-thought reasoning vs. fast inference), creating deployment complexity and resource overhead. Existing open-source models lack strong multilingual support, particularly for Korean, limiting their utility in non-English enterprise contexts. Balancing reasoning depth with inference efficiency at scale remains an unsolved challenge for practical real-world deployment.
Key Novelty
- Think-Fusion training recipe: a unified training approach enabling explicit user-controlled switching between 'thinking' (extended reasoning) and 'non-thinking' (fast inference) modes within a single model
- Scaling law-guided optimization of both training configurations and vocabulary size under fixed computational budgets for a 519B MoE architecture
- Multi-stage data processing pipeline curating ~10T tokens with specialized curation for Korean-language data, achieving state-of-the-art performance on Korean benchmarks
Evaluation Highlights
- A.X K1 achieves performance competitive with leading open-source models (e.g., comparable to DeepSeek, Qwen, LLaMA-scale MoE models) on general reasoning and language benchmarks
- Establishes a distinctive advantage over competing models on Korean-language benchmarks, demonstrating superior multilingual specialization for Korean
Breakthrough Assessment
Methodology
- Apply scaling laws to determine optimal model architecture (519B MoE), vocabulary size, and training hyperparameters under a fixed compute budget before training begins
- Pre-train on ~10T tokens curated via a multi-stage data pipeline emphasizing quality filtering, deduplication, and Korean-language data enrichment
- Apply Think-Fusion post-training recipe to fine-tune the model to support controllable reasoning, allowing users to explicitly toggle between extended thinking mode and direct response mode at inference time
System Components
Mixture-of-Experts model with 519B total parameters, activating a sparse subset per token to balance capacity with computational efficiency during inference
A unified post-training methodology that teaches the model to switch between chain-of-thought reasoning (thinking mode) and direct answer generation (non-thinking mode) based on user instruction, eliminating the need for separate models
A curated data processing system for constructing the ~10T token pre-training corpus, with dedicated stages for quality filtering, deduplication, domain balancing, and Korean-language data enrichment
A framework that uses empirical scaling laws to determine the optimal training configuration (model size, learning rate, batch size, vocabulary size) given a fixed FLOPs budget
Specialized data curation and training focus on Korean-language content, enabling benchmark-leading performance on Korean NLP tasks
Results
| Benchmark | Leading Open-Source Baseline | A.X K1 | Delta |
|---|---|---|---|
| Korean Language Benchmarks | Competitive open-source models (e.g., Qwen, LLaMA) | State-of-the-art among open-source | Distinctive advantage |
| General Reasoning Benchmarks | Leading open-source MoE models | Competitive / on-par | Neutral to slight improvement |
| Thinking Mode Tasks | Single-mode reasoning models | Unified model matches dedicated reasoning models | Efficiency gain (1 model vs. 2) |
| Non-Thinking Mode Tasks | Single-mode fast inference models | Competitive with fast-inference specialists | No quality degradation from unification |
Key Takeaways
- Think-Fusion offers a practical deployment pattern: a single MoE model can replace two separate models (reasoning and non-reasoning), reducing infrastructure complexity and memory footprint for production LLM serving
- Scaling law-guided vocabulary and architecture optimization before training is a cost-effective practice — ML teams building large models from scratch should invest in scaling law experiments to avoid suboptimal compute allocation
- For organizations targeting non-English markets (especially Korean), A.X K1 demonstrates that language-specific data curation within a large MoE framework can yield measurable benchmark advantages without sacrificing general capability
Abstract
We introduce A.X K1, a 519B-parameter Mixture-of-Experts (MoE) language model trained from scratch. Our design leverages scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. A.X K1 is pre-trained on a corpus of approximately 10T tokens, curated by a multi-stage data processing pipeline. Designed to bridge the gap between reasoning capability and inference efficiency, A.X K1 supports explicitly controllable reasoning to facilitate scalable deployment across diverse real-world scenarios. We propose a simple yet effective Think-Fusion training recipe, enabling user-controlled switching between thinking and non-thinking modes within a single unified model. Extensive evaluations demonstrate that A.X K1 achieves performance competitive with leading open-source models, while establishing a distinctive advantage in Korean-language benchmarks.