PR-Attack: Coordinated Prompt-RAG Attacks on Retrieval-Augmented Generation in Large Language Models via Bilevel Optimization

PR-Attack introduces a coordinated backdoor attack on Retrieval-Augmented Generation (RAG) systems that simultaneously poisons the knowledge database and embeds triggers in prompts, formulated as a bilevel optimization problem to achieve high attack success rates with minimal poisoned texts and improved stealth.

Problem Statement

RAG-based LLMs are vulnerable to adversarial attacks, but existing attack methods fail in low-poisoning-budget scenarios, are easily detected by anomaly detection systems, and rely on heuristic rather than principled optimization approaches. These limitations reduce their practical threat modeling value and leave a gap in understanding the true security risks of RAG systems deployed in critical applications like medical QA.

Key Novelty

Coordinated dual-vector attack combining prompt-level backdoor triggers with RAG knowledge-base poisoning, rather than attacking either component in isolation
Formal bilevel optimization framework for generating optimal poisoned texts and triggers, providing theoretical grounding absent in prior heuristic-based attacks
Achieves high attack success rate under a strictly limited poisoned-text budget while maintaining stealth against anomaly detection systems

Evaluation Highlights

High attack success rate achieved even with a small number of poisoned texts injected into the knowledge database, outperforming existing RAG attack baselines
Significantly improved stealth compared to existing methods, meaning the poisoned texts are less likely to be flagged by anomaly detection systems while remaining effective

Breakthrough Assessment

7/10 PR-Attack represents a significant advance in adversarial robustness research for RAG systems by introducing the first principled bilevel optimization framework for coordinated prompt-database attacks, which meaningfully raises the bar for understanding and defending RAG security — though its impact is scoped to a specific threat model rather than reshaping the broader field.

Methodology

Formulate the attack as a bilevel optimization problem: the outer level optimizes the backdoor trigger embedded in the prompt to maximize the probability of activating malicious behavior, while the inner level optimizes poisoned texts injected into the RAG knowledge database to be retrieved in response to targeted queries
Solve the bilevel problem iteratively — alternating between updating the trigger (outer problem) and updating the poisoned documents (inner problem) — leveraging gradient-based optimization to find adversarially effective and stealthy solutions
Evaluate the optimized attack across diverse LLMs and datasets, measuring attack success rate (ASR) under limited poisoning budgets and stealth against anomaly detection, comparing against existing RAG attack baselines

System Components

Backdoor Trigger (Prompt-level)

A small adversarial perturbation or token sequence embedded in the user prompt that, when present, activates malicious behavior in the LLM, while the model behaves normally on clean prompts

Poisoned Knowledge Database Texts

A minimal set of adversarially crafted documents injected into the RAG retrieval corpus, designed to be retrieved for targeted queries and guide the LLM toward pre-designed malicious outputs

Bilevel Optimization Framework

A two-level optimization formulation where the outer loop optimizes the trigger and the inner loop optimizes poisoned texts, enabling joint, principled co-optimization of both attack components

Stealth Mechanism

Constraints or regularization within the optimization that ensure poisoned texts appear benign to anomaly detection systems, reducing detectability compared to heuristic attacks

Results

Metric/Benchmark	Baseline (Existing Attacks)	PR-Attack	Delta
Attack Success Rate (low poison budget)	Low / degraded	High	Significant improvement
Stealth (anomaly detection bypass)	Often detectable	Significantly improved stealth	Meaningful reduction in detection rate
Optimization approach	Heuristic, no guarantees	Principled bilevel optimization	Formal framework with theoretical grounding
Datasets/LLMs coverage	Limited	Diverse LLMs and datasets	Broader generalization demonstrated

Key Takeaways

RAG systems face a compounded threat when both the retrieval corpus and the prompt interface are simultaneously attacked — defenders should not treat these as independent attack surfaces
Bilevel optimization is a powerful framework for adversarial attack design in compound AI systems (e.g., LLM + retrieval), and similar formulations may apply to other agentic or tool-augmented LLM pipelines
Security audits of RAG deployments in high-stakes domains (medical QA, legal, finance) must account for low-budget, stealthy coordinated attacks, not just bulk data poisoning, necessitating more robust anomaly detection and retrieval filtering strategies

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of applications, e.g., medical question-answering, mathematical sciences, and code generation. However, they also exhibit inherent limitations, such as outdated knowledge and susceptibility to hallucinations. Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm to address these issues, but it also introduces new vulnerabilities. Recent efforts have focused on the security of RAG-based LLMs, yet existing attack methods face three critical challenges: (1) their effectiveness declines sharply when only a limited number of poisoned texts can be injected into the knowledge database (2) they lack sufficient stealth, as the attacks are often detectable by anomaly detection systems, which compromises their effectiveness, and (3) they rely on heuristic approaches to generate poisoned texts, lacking formal optimization frameworks and theoretic guarantees, which limits their effectiveness and applicability. To address these issues, we propose coordinated Prompt-RAG attack (PR-attack), a novel optimization-driven attack that introduces a small number of poisoned texts into the knowledge database while embedding a backdoor trigger within the prompt. When activated, the trigger causes the LLM to generate pre-designed responses to targeted queries, while maintaining normal behavior in other contexts. This ensures both high effectiveness and stealth. We formulate the attack generation process as a bilevel optimization problem leveraging a principled optimization framework to develop optimal poisoned texts and triggers. Extensive experiments across diverse LLMs and datasets demonstrate the effectiveness of PR-Attack, achieving a high attack success rate even with a limited number of poisoned texts and significantly improved stealth compared to existing methods. These results highlight the potential risks posed by PR-Attack and emphasize the importance of securing RAG-based LLMs against such threats.