← Back to Papers

A.X K1 Technical Report

SK Telecom
arXiv.org | 2026
A.X K1 is a 519B-parameter Mixture-of-Experts language model trained from scratch by SK Telecom, featuring a novel Think-Fusion training recipe that enables user-controlled switching between reasoning and non-reasoning modes within a single unified model. The model is optimized for both reasoning capability and inference efficiency, with particular strength in Korean-language tasks.

Problem Statement

Large language models often require separate models for different inference modes (e.g., chain-of-thought reasoning vs. fast inference), creating deployment complexity and resource overhead. Existing open-source models lack strong multilingual support, particularly for Korean, limiting their utility in non-English enterprise contexts. Balancing reasoning depth with inference efficiency at scale remains an unsolved challenge for practical real-world deployment.

Key Novelty

  • Think-Fusion training recipe: a unified training approach enabling explicit user-controlled switching between 'thinking' (extended reasoning) and 'non-thinking' (fast inference) modes within a single model
  • Scaling law-guided optimization of both training configurations and vocabulary size under fixed computational budgets for a 519B MoE architecture
  • Multi-stage data processing pipeline curating ~10T tokens with specialized curation for Korean-language data, achieving state-of-the-art performance on Korean benchmarks

Evaluation Highlights

  • A.X K1 achieves performance competitive with leading open-source models (e.g., comparable to DeepSeek, Qwen, LLaMA-scale MoE models) on general reasoning and language benchmarks
  • Establishes a distinctive advantage over competing models on Korean-language benchmarks, demonstrating superior multilingual specialization for Korean

Breakthrough Assessment

5/10 A.X K1 is a solid engineering and research contribution with a practical innovation in Think-Fusion for controllable reasoning, but it is primarily a scaled MoE model with incremental architectural novelty; the core techniques build on established MoE, scaling law, and RLHF/reasoning training paradigms rather than introducing fundamentally new methods.

Methodology

  1. Apply scaling laws to determine optimal model architecture (519B MoE), vocabulary size, and training hyperparameters under a fixed compute budget before training begins
  2. Pre-train on ~10T tokens curated via a multi-stage data pipeline emphasizing quality filtering, deduplication, and Korean-language data enrichment
  3. Apply Think-Fusion post-training recipe to fine-tune the model to support controllable reasoning, allowing users to explicitly toggle between extended thinking mode and direct response mode at inference time

System Components

519B MoE Architecture

Mixture-of-Experts model with 519B total parameters, activating a sparse subset per token to balance capacity with computational efficiency during inference

Think-Fusion Training Recipe

A unified post-training methodology that teaches the model to switch between chain-of-thought reasoning (thinking mode) and direct answer generation (non-thinking mode) based on user instruction, eliminating the need for separate models

Multi-Stage Data Pipeline

A curated data processing system for constructing the ~10T token pre-training corpus, with dedicated stages for quality filtering, deduplication, domain balancing, and Korean-language data enrichment

Scaling Law Optimizer

A framework that uses empirical scaling laws to determine the optimal training configuration (model size, learning rate, batch size, vocabulary size) given a fixed FLOPs budget

Korean Language Module

Specialized data curation and training focus on Korean-language content, enabling benchmark-leading performance on Korean NLP tasks

Results

Benchmark Leading Open-Source Baseline A.X K1 Delta
Korean Language Benchmarks Competitive open-source models (e.g., Qwen, LLaMA) State-of-the-art among open-source Distinctive advantage
General Reasoning Benchmarks Leading open-source MoE models Competitive / on-par Neutral to slight improvement
Thinking Mode Tasks Single-mode reasoning models Unified model matches dedicated reasoning models Efficiency gain (1 model vs. 2)
Non-Thinking Mode Tasks Single-mode fast inference models Competitive with fast-inference specialists No quality degradation from unification

Key Takeaways

  • Think-Fusion offers a practical deployment pattern: a single MoE model can replace two separate models (reasoning and non-reasoning), reducing infrastructure complexity and memory footprint for production LLM serving
  • Scaling law-guided vocabulary and architecture optimization before training is a cost-effective practice — ML teams building large models from scratch should invest in scaling law experiments to avoid suboptimal compute allocation
  • For organizations targeting non-English markets (especially Korean), A.X K1 demonstrates that language-specific data curation within a large MoE framework can yield measurable benchmark advantages without sacrificing general capability

Abstract

We introduce A.X K1, a 519B-parameter Mixture-of-Experts (MoE) language model trained from scratch. Our design leverages scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. A.X K1 is pre-trained on a corpus of approximately 10T tokens, curated by a multi-stage data processing pipeline. Designed to bridge the gap between reasoning capability and inference efficiency, A.X K1 supports explicitly controllable reasoning to facilitate scalable deployment across diverse real-world scenarios. We propose a simple yet effective Think-Fusion training recipe, enabling user-controlled switching between thinking and non-thinking modes within a single unified model. Extensive evaluations demonstrate that A.X K1 achieves performance competitive with leading open-source models, while establishing a distinctive advantage in Korean-language benchmarks.

Generated on 2026-03-02 using Claude