Benchmarking Multi-Step Legal Reasoning and Analyzing Chain-of-Thought Effects in Large Language Models
Problem Statement
Existing legal AI benchmarks conflate factual retrieval with genuine multi-step inference, fragment the reasoning process into isolated subtasks, and neglect the quality of intermediate reasoning steps. This makes it difficult to assess whether LLMs can perform the kind of structured, expert-level legal reasoning required in real judicial decision-making. There is also a lack of fine-grained, step-level annotated legal reasoning datasets, especially for Chinese legal contexts.
Key Novelty
- MSLR: The first Chinese multi-step legal reasoning dataset structured around the IRAC (Issue, Rule, Application, Conclusion) framework, grounded in real judicial documents
- A scalable Human-LLM collaborative annotation pipeline that produces fine-grained step-level reasoning annotations efficiently and serves as a reusable methodology for future multi-step reasoning datasets
- Empirical finding that Self-Initiated Chain-of-Thought prompts—autonomously generated by models—outperform human-designed prompts in reasoning coherence and quality on legal tasks
Evaluation Highlights
- Multiple LLMs evaluated on MSLR achieved only moderate performance, indicating that complex multi-step legal reasoning remains a significant challenge for current frontier models
- Self-Initiated CoT prompts generated autonomously by models demonstrated superior reasoning coherence and quality compared to human-designed CoT prompts across evaluated models
Breakthrough Assessment
Methodology
- Construct the MSLR dataset by sourcing real Chinese judicial decisions and structuring them into IRAC (Issue, Rule, Application, Conclusion) reasoning steps to model expert legal inference
- Develop a Human-LLM collaborative annotation pipeline where LLMs generate candidate step-level reasoning annotations that human experts review and refine, enabling scalable fine-grained labeling
- Benchmark multiple LLMs on MSLR using various prompting strategies—including human-designed and Self-Initiated Chain-of-Thought prompts—and analyze performance across IRAC steps to assess reasoning quality
System Components
A Chinese multi-step legal reasoning benchmark derived from real judicial documents, annotated with step-level reasoning following the IRAC framework to test genuine inference rather than factual recall
Structures legal reasoning into four ordered steps—Issue identification, Rule retrieval, Application of rule to facts, and Conclusion—mirroring expert judicial reasoning
A scalable annotation methodology where LLMs draft fine-grained step-level reasoning labels that human annotators verify and refine, balancing quality with annotation efficiency
A prompting strategy where the model autonomously generates its own reasoning chain prompts rather than following human-designed templates, shown to improve coherence on legal tasks
Results
| Metric/Benchmark | Baseline (Human-Designed CoT) | This Paper (Self-Initiated CoT) | Delta |
|---|---|---|---|
| Overall MSLR Performance (LLMs) | Moderate (exact scores not reported in abstract) | Moderate with Self-Initiated CoT improvement | Positive improvement in coherence/quality |
| Reasoning Coherence Quality | Lower (human-designed CoT prompts) | Higher (Self-Initiated CoT prompts) | Self-Initiated CoT outperforms human-designed |
Key Takeaways
- For legal AI practitioners: MSLR provides a rigorous, real-world benchmark for evaluating LLMs on structured multi-step legal reasoning in Chinese, moving beyond superficial factual QA tasks
- For prompt engineers and ML researchers: Allowing models to self-generate their own Chain-of-Thought prompts can outperform carefully crafted human-designed prompts, suggesting that model-autonomy in reasoning scaffolding is a valuable design choice
- For dataset creators: The Human-LLM collaborative annotation pipeline offers a reusable and scalable template for building fine-grained step-level reasoning datasets in other specialized domains beyond law
Abstract
Large language models (LLMs) have demonstrated strong reasoning abilities across specialized domains, motivating research into their application to legal reasoning. However, existing legal benchmarks often conflate factual recall with genuine inference, fragment the reasoning process, and overlook the quality of reasoning. To address these limitations, we introduce MSLR, the first Chinese multi-step legal reasoning dataset grounded in real-world judicial decision making. MSLR adopts the IRAC framework (Issue, Rule, Application, Conclusion) to model structured expert reasoning from official legal documents. In addition, we design a scalable Human-LLM collaborative annotation pipeline that efficiently produces fine-grained step-level reasoning annotations and provides a reusable methodological framework for multi-step reasoning datasets. Evaluation of multiple LLMs on MSLR shows only moderate performance, highlighting the challenges of adapting to complex legal reasoning. Further experiments demonstrate that Self-Initiated Chain-of-Thought prompts generated by models autonomously improve reasoning coherence and quality, outperforming human-designed prompts. MSLR contributes to advancing LLM reasoning and Chain-of-Thought strategies and offers open resources for future research. The dataset and code are available at https://github.com/yuwenhan07/MSLR-Bench and https://law.sjtu.edu.cn/flszyjzx/index.html.