← Back to Papers

CORE-KG: An LLM-Driven Knowledge Graph Construction Framework for Human Smuggling Networks

D. Meher, Carlotta Domeniconi, Guadalupe Correa-Cabrera
arXiv.org | 2025
CORE-KG is a modular LLM-driven framework that constructs cleaner, more coherent knowledge graphs from unstructured legal texts about human smuggling networks by combining type-aware coreference resolution with domain-guided entity and relationship extraction.

Problem Statement

Human smuggling case documents are unstructured, lexically dense, and contain ambiguous or shifting entity references, making automated knowledge graph construction unreliable. Existing KG methods rely on static templates lacking coreference resolution, while LLM-based approaches like GraphRAG produce noisy, fragmented graphs with hallucinated content and duplicate nodes. This limits investigators' ability to systematically analyze complex criminal network structures from legal records.

Key Novelty

  • Type-aware coreference resolution using sequential, structured LLM prompts to resolve ambiguous and shifting entity references before extraction
  • Domain-guided entity and relationship extraction built on an adapted GraphRAG framework with specialized instructions tailored to criminal/legal text domains
  • A modular two-step pipeline architecture that explicitly decouples coreference resolution from extraction, reducing node duplication and legal noise in the resulting knowledge graph

Evaluation Highlights

  • 33.28% reduction in node duplication compared to the GraphRAG-based baseline, indicating fewer fragmented or redundant entity representations
  • 38.37% reduction in legal noise compared to the GraphRAG-based baseline, resulting in cleaner and more semantically coherent graph structures

Breakthrough Assessment

5/10 CORE-KG makes a solid, domain-specific contribution by addressing real limitations in LLM-based KG construction (coreference and noise), but the approach is largely a well-engineered combination of existing techniques (GraphRAG + structured prompting) applied to a specialized domain rather than a fundamentally new algorithmic advance.

Methodology

  1. Step 1 - Coreference Resolution: Apply type-aware coreference resolution using sequential, structured LLM prompts to identify and consolidate ambiguous or co-referring entity mentions across legal documents
  2. Step 2 - Domain-Guided Extraction: Use domain-specific instructions within an adapted GraphRAG framework to extract entities and relationships, leveraging the resolved coreferences to reduce duplication and noise
  3. Step 3 - Knowledge Graph Assembly: Compile extracted entities and relationships into an interpretable knowledge graph structure suitable for downstream analysis of criminal network topology

System Components

Type-Aware Coreference Resolver

Uses sequential, structured LLM prompts to detect and resolve co-referring mentions of entities (e.g., aliases, pronouns, role-based references) across legal documents, categorized by entity type

Domain-Guided Extractor

An adapted GraphRAG module that uses domain-specific prompts and instructions to extract entities and relationships relevant to human smuggling networks, reducing hallucinations and irrelevant legal boilerplate

GraphRAG Adaptation Layer

Modified GraphRAG pipeline that integrates coreference outputs and domain constraints to produce structured, deduplicated knowledge graph triples from raw legal text

CORE-KG Pipeline Orchestrator

Modular two-step framework controller that sequences coreference resolution before extraction, ensuring clean inputs at each stage and enabling independent component improvement

Results

Metric GraphRAG Baseline CORE-KG Delta
Node Duplication Rate Baseline level 33.28% lower -33.28%
Legal Noise in Graph Baseline level 38.37% lower -38.37%
Graph Coherence Fragmented/noisy Cleaner, more coherent Qualitative improvement
Coreference Handling None / static templates Type-aware LLM resolution New capability

Key Takeaways

  • Decoupling coreference resolution from extraction in LLM-based KG pipelines is a practical strategy to reduce node duplication and improve graph quality, especially for domains with aliased or role-shifting entities
  • Domain-guided prompting in GraphRAG-style systems significantly reduces hallucinated or irrelevant extractions — investing in domain-specific instruction design yields measurable graph quality improvements over generic LLM prompting
  • The CORE-KG modular architecture is transferable to other high-stakes NLP domains (e.g., financial crime, counter-terrorism) where legal or regulatory documents contain dense, ambiguous entity references requiring structured extraction pipelines

Abstract

Human smuggling networks are increasingly adaptive and difficult to analyze. Legal case documents offer valuable insights but are unstructured, lexically dense, and filled with ambiguous or shifting references-posing challenges for automated knowledge graph (KG) construction. Existing KG methods often rely on static templates and lack coreference resolution, while recent LLM-based approaches frequently produce noisy, fragmented graphs due to hallucinations, and duplicate nodes caused by a lack of guided extraction. We propose CORE-KG, a modular framework for building interpretable KGs from legal texts. It uses a two-step pipeline: (1) type-aware coreference resolution via sequential, structured LLM prompts, and (2) entity and relationship extraction using domain-guided instructions, built on an adapted GraphRAG framework. CORE-KG reduces node duplication by 33.28%, and legal noise by 38.37% compared to a GraphRAG-based baseline-resulting in cleaner and more coherent graph structures. These improvements make CORE-KG a strong foundation for analyzing complex criminal networks.

Generated on 2026-03-03 using Claude