Benchmarking Multi-Step Legal Reasoning and Analyzing Chain-of-Thought Effects in Large Language Models

The paper introduces MSLR, the first Chinese multi-step legal reasoning benchmark grounded in the IRAC framework, to evaluate LLMs on structured judicial reasoning rather than mere factual recall, and investigates the impact of Chain-of-Thought prompting strategies on reasoning quality.

Problem Statement

Existing legal AI benchmarks conflate factual retrieval with genuine multi-step inference, fragment the reasoning process into isolated subtasks, and neglect the quality of intermediate reasoning steps. This makes it difficult to assess whether LLMs can perform the kind of structured, expert-level legal reasoning required in real judicial decision-making. There is also a lack of fine-grained, step-level annotated legal reasoning datasets, especially for Chinese legal contexts.

Key Novelty

MSLR: The first Chinese multi-step legal reasoning dataset structured around the IRAC (Issue, Rule, Application, Conclusion) framework, grounded in real judicial documents
A scalable Human-LLM collaborative annotation pipeline that produces fine-grained step-level reasoning annotations efficiently and serves as a reusable methodology for future multi-step reasoning datasets
Empirical finding that Self-Initiated Chain-of-Thought prompts—autonomously generated by models—outperform human-designed prompts in reasoning coherence and quality on legal tasks

Evaluation Highlights

Multiple LLMs evaluated on MSLR achieved only moderate performance, indicating that complex multi-step legal reasoning remains a significant challenge for current frontier models
Self-Initiated CoT prompts generated autonomously by models demonstrated superior reasoning coherence and quality compared to human-designed CoT prompts across evaluated models

Signal Assessment

5/10 MSLR is a solid, well-motivated contribution that fills a clear gap in legal AI benchmarking with a structured, real-world grounded dataset and a reusable annotation pipeline; however, the core techniques (IRAC structuring, CoT prompting analysis) are extensions of established methods rather than fundamental algorithmic advances.

Methodology

Construct the MSLR dataset by sourcing real Chinese judicial decisions and structuring them into IRAC (Issue, Rule, Application, Conclusion) reasoning steps to model expert legal inference
Develop a Human-LLM collaborative annotation pipeline where LLMs generate candidate step-level reasoning annotations that human experts review and refine, enabling scalable fine-grained labeling
Benchmark multiple LLMs on MSLR using various prompting strategies—including human-designed and Self-Initiated Chain-of-Thought prompts—and analyze performance across IRAC steps to assess reasoning quality

System Components

MSLR Dataset

A Chinese multi-step legal reasoning benchmark derived from real judicial documents, annotated with step-level reasoning following the IRAC framework to test genuine inference rather than factual recall

IRAC Reasoning Framework

Structures legal reasoning into four ordered steps—Issue identification, Rule retrieval, Application of rule to facts, and Conclusion—mirroring expert judicial reasoning

Human-LLM Collaborative Annotation Pipeline

A scalable annotation methodology where LLMs draft fine-grained step-level reasoning labels that human annotators verify and refine, balancing quality with annotation efficiency

Self-Initiated Chain-of-Thought (CoT) Prompting

A prompting strategy where the model autonomously generates its own reasoning chain prompts rather than following human-designed templates, shown to improve coherence on legal tasks

Results

Metric/Benchmark	Baseline (Human-Designed CoT)	This Paper (Self-Initiated CoT)	Delta
Overall MSLR Performance (LLMs)	Moderate (exact scores not reported in abstract)	Moderate with Self-Initiated CoT improvement	Positive improvement in coherence/quality
Reasoning Coherence Quality	Lower (human-designed CoT prompts)	Higher (Self-Initiated CoT prompts)	Self-Initiated CoT outperforms human-designed

Key Takeaways

For legal AI practitioners: MSLR provides a rigorous, real-world benchmark for evaluating LLMs on structured multi-step legal reasoning in Chinese, moving beyond superficial factual QA tasks
For prompt engineers and ML researchers: Allowing models to self-generate their own Chain-of-Thought prompts can outperform carefully crafted human-designed prompts, suggesting that model-autonomy in reasoning scaffolding is a valuable design choice
For dataset creators: The Human-LLM collaborative annotation pipeline offers a reusable and scalable template for building fine-grained step-level reasoning datasets in other specialized domains beyond law

Abstract

Large language models (LLMs) have demonstrated strong reasoning abilities across specialized domains, motivating research into their application to legal reasoning. However, existing legal benchmarks often conflate factual recall with genuine inference, fragment the reasoning process, and overlook the quality of reasoning. To address these limitations, we introduce MSLR, the first Chinese multi-step legal reasoning dataset grounded in real-world judicial decision making. MSLR adopts the IRAC framework (Issue, Rule, Application, Conclusion) to model structured expert reasoning from official legal documents. In addition, we design a scalable Human-LLM collaborative annotation pipeline that efficiently produces fine-grained step-level reasoning annotations and provides a reusable methodological framework for multi-step reasoning datasets. Evaluation of multiple LLMs on MSLR shows only moderate performance, highlighting the challenges of adapting to complex legal reasoning. Further experiments demonstrate that Self-Initiated Chain-of-Thought prompts generated by models autonomously improve reasoning coherence and quality, outperforming human-designed prompts. MSLR contributes to advancing LLM reasoning and Chain-of-Thought strategies and offers open resources for future research. The dataset and code are available at https://github.com/yuwenhan07/MSLR-Bench and https://law.sjtu.edu.cn/flszyjzx/index.html.