PR-Attack: Coordinated Prompt-RAG Attacks on Retrieval-Augmented Generation in Large Language Models via Bilevel Optimization
Problem Statement
RAG-based LLMs are vulnerable to adversarial attacks, but existing attack methods fail in low-poisoning-budget scenarios, are easily detected by anomaly detection systems, and rely on heuristic rather than principled optimization approaches. These limitations reduce their practical threat modeling value and leave a gap in understanding the true security risks of RAG systems deployed in critical applications like medical QA.
Key Novelty
- Coordinated dual-vector attack combining prompt-level backdoor triggers with RAG knowledge-base poisoning, rather than attacking either component in isolation
- Formal bilevel optimization framework for generating optimal poisoned texts and triggers, providing theoretical grounding absent in prior heuristic-based attacks
- Achieves high attack success rate under a strictly limited poisoned-text budget while maintaining stealth against anomaly detection systems
Evaluation Highlights
- High attack success rate achieved even with a small number of poisoned texts injected into the knowledge database, outperforming existing RAG attack baselines
- Significantly improved stealth compared to existing methods, meaning the poisoned texts are less likely to be flagged by anomaly detection systems while remaining effective
Breakthrough Assessment
Methodology
- Formulate the attack as a bilevel optimization problem: the outer level optimizes the backdoor trigger embedded in the prompt to maximize the probability of activating malicious behavior, while the inner level optimizes poisoned texts injected into the RAG knowledge database to be retrieved in response to targeted queries
- Solve the bilevel problem iteratively — alternating between updating the trigger (outer problem) and updating the poisoned documents (inner problem) — leveraging gradient-based optimization to find adversarially effective and stealthy solutions
- Evaluate the optimized attack across diverse LLMs and datasets, measuring attack success rate (ASR) under limited poisoning budgets and stealth against anomaly detection, comparing against existing RAG attack baselines
System Components
A small adversarial perturbation or token sequence embedded in the user prompt that, when present, activates malicious behavior in the LLM, while the model behaves normally on clean prompts
A minimal set of adversarially crafted documents injected into the RAG retrieval corpus, designed to be retrieved for targeted queries and guide the LLM toward pre-designed malicious outputs
A two-level optimization formulation where the outer loop optimizes the trigger and the inner loop optimizes poisoned texts, enabling joint, principled co-optimization of both attack components
Constraints or regularization within the optimization that ensure poisoned texts appear benign to anomaly detection systems, reducing detectability compared to heuristic attacks
Results
| Metric/Benchmark | Baseline (Existing Attacks) | PR-Attack | Delta |
|---|---|---|---|
| Attack Success Rate (low poison budget) | Low / degraded | High | Significant improvement |
| Stealth (anomaly detection bypass) | Often detectable | Significantly improved stealth | Meaningful reduction in detection rate |
| Optimization approach | Heuristic, no guarantees | Principled bilevel optimization | Formal framework with theoretical grounding |
| Datasets/LLMs coverage | Limited | Diverse LLMs and datasets | Broader generalization demonstrated |
Key Takeaways
- RAG systems face a compounded threat when both the retrieval corpus and the prompt interface are simultaneously attacked — defenders should not treat these as independent attack surfaces
- Bilevel optimization is a powerful framework for adversarial attack design in compound AI systems (e.g., LLM + retrieval), and similar formulations may apply to other agentic or tool-augmented LLM pipelines
- Security audits of RAG deployments in high-stakes domains (medical QA, legal, finance) must account for low-budget, stealthy coordinated attacks, not just bulk data poisoning, necessitating more robust anomaly detection and retrieval filtering strategies
Abstract
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of applications, e.g., medical question-answering, mathematical sciences, and code generation. However, they also exhibit inherent limitations, such as outdated knowledge and susceptibility to hallucinations. Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm to address these issues, but it also introduces new vulnerabilities. Recent efforts have focused on the security of RAG-based LLMs, yet existing attack methods face three critical challenges: (1) their effectiveness declines sharply when only a limited number of poisoned texts can be injected into the knowledge database (2) they lack sufficient stealth, as the attacks are often detectable by anomaly detection systems, which compromises their effectiveness, and (3) they rely on heuristic approaches to generate poisoned texts, lacking formal optimization frameworks and theoretic guarantees, which limits their effectiveness and applicability. To address these issues, we propose coordinated Prompt-RAG attack (PR-attack), a novel optimization-driven attack that introduces a small number of poisoned texts into the knowledge database while embedding a backdoor trigger within the prompt. When activated, the trigger causes the LLM to generate pre-designed responses to targeted queries, while maintaining normal behavior in other contexts. This ensures both high effectiveness and stealth. We formulate the attack generation process as a bilevel optimization problem leveraging a principled optimization framework to develop optimal poisoned texts and triggers. Extensive experiments across diverse LLMs and datasets demonstrate the effectiveness of PR-Attack, achieving a high attack success rate even with a limited number of poisoned texts and significantly improved stealth compared to existing methods. These results highlight the potential risks posed by PR-Attack and emphasize the importance of securing RAG-based LLMs against such threats.