← Back to Papers

Adversarial Attacks against Neural Ranking Models via In-Context Learning

Amin Bigdeli, Negar Arabzadeh, Ebrahim Bagheri, Charles L. A. Clarke
SIGIR-AP | 2025
FSAP (Few-Shot Adversarial Prompting) is a black-box framework that uses LLM in-context learning to generate fluent, topically coherent adversarial documents that consistently outrank credible content in neural ranking models without requiring gradient access.

Problem Statement

Neural ranking models (NRMs) powering search engines are vulnerable to adversarial manipulation, posing risks like health misinformation being surfaced above credible sources. Prior attack methods rely on token-level perturbations or manual rewriting, requiring white-box access or significant human effort. FSAP addresses the gap by enabling scalable, realistic black-box attacks using only few-shot prompting with no internal model access.

Key Novelty

  • First framework to formulate adversarial attacks on neural ranking models entirely through few-shot prompting, eliminating the need for gradient access or model instrumentation
  • Dual-mode design: FSAPIntraQ for query-specific attacks leveraging same-query harmful examples, and FSAPInterQ for cross-query generalization by transferring adversarial patterns
  • Demonstrates that LLM-generated adversarial documents are both hard to detect and exhibit strong stance alignment, making them a realistic and scalable threat

Evaluation Highlights

  • FSAP-generated adversarial documents consistently outranked credible, factually accurate documents across four diverse neural ranking models on TREC 2020 and 2021 Health Misinformation Tracks
  • Adversarial outputs showed strong stance alignment with misinformation targets and low detectability, and the approach generalized effectively across both proprietary and open-source LLMs

Breakthrough Assessment

6/10 FSAP is a solid and practically significant contribution that lowers the barrier for adversarial attacks on IR systems by removing gradient requirements, but it is more of a novel application of existing ICL capabilities to a known vulnerability class rather than a fundamental methodological breakthrough.

Methodology

  1. Collect a small support set of previously observed adversarial or harmful documents relevant to target queries to serve as few-shot demonstrations
  2. Condition an LLM with these few-shot examples to synthesize new grammatically fluent, topically coherent documents that embed false or misleading content aligned with the adversarial goal
  3. Inject generated documents into the retrieval corpus and evaluate ranking position against credible documents across multiple NRMs to assess attack effectiveness, stance alignment, and detectability

System Components

FSAPIntraQ

Attack mode that sources few-shot examples from harmful documents associated with the same query, maximizing topic fidelity and relevance for targeted attacks

FSAPInterQ

Attack mode that transfers adversarial patterns from examples across unrelated queries, enabling broader generalization and scalable attacks without query-specific data

Few-Shot Adversarial Prompt Constructor

Core ICL mechanism that conditions the LLM on harmful support examples to generate new adversarial documents without requiring gradient access or white-box model information

Stance Alignment Evaluator

Analysis component that measures whether generated adversarial documents consistently convey the intended misleading stance

Results

Metric/Benchmark Baseline (Credible Docs) FSAP (Adversarial Docs) Delta
Ranking position vs. credible docs (TREC 2020 Health Misinfo) Credible docs rank above FSAP docs consistently outrank credible docs Significant rank promotion
Ranking position vs. credible docs (TREC 2021 Health Misinfo) Credible docs rank above FSAP docs consistently outrank credible docs Significant rank promotion
Detectability of adversarial content High detectability (prior token-level attacks) Low detectability Improved stealth
Generalization across LLMs Single model (prior methods) Effective across proprietary and open-source LLMs Broad transferability

Key Takeaways

  • Neural ranking models deployed in high-stakes domains (e.g., health search) are vulnerable to scalable black-box attacks using off-the-shelf LLMs, requiring no privileged model access — defenders must test robustness against prompt-based adversarial content
  • Cross-query adversarial transfer (FSAPInterQ) means attackers don't need query-specific data, making it feasible to launch broad campaigns against retrieval systems at low cost
  • Low detectability of FSAP outputs suggests that standard content moderation and adversarial detection pipelines may be insufficient; retrieval system audits should incorporate fluency-aware and stance-aware adversarial testing

Abstract

While neural ranking models (NRMs) have shown high effectiveness, they remain susceptible to adversarial manipulation. In this work, we introduce Few-Shot Adversarial Prompting (FSAP), a novel black-box attack framework that leverages the in-context learning capabilities of Large Language Models (LLMs) to generate high-ranking adversarial documents. Unlike previous approaches that rely on token-level perturbations or manual rewriting of existing documents, FSAP formulates adversarial attacks entirely through few-shot prompting, requiring no gradient access or internal model instrumentation. By conditioning the LLM on a small support set of previously observed harmful examples, FSAP synthesizes grammatically fluent and topically coherent documents that subtly embed false or misleading information and rank competitively against authentic content. We instantiate FSAP in two modes: FSAPIntraQ, which leverages harmful examples from the same query to enhance topic fidelity, and FSAPInterQ, which enables broader generalization by transferring adversarial patterns across unrelated queries. Our experiments on the TREC 2020 and 2021 Health Misinformation Tracks, using four diverse neural ranking models, reveal that FSAP-generated documents consistently outrank credible, factually accurate documents. Furthermore, our analysis demonstrates that these adversarial outputs exhibit strong stance alignment and low detectability, posing a realistic and scalable threat to neural retrieval systems. FSAP also effectively generalizes across both proprietary and open-source LLMs.

Generated on 2026-03-03 using Claude