Dossier №01 · Project 19 · Ragprobe

Ragprobe: Adversarial Robustness for Clinical Retrieval-Augmented Generation

Corpus poisoning, prompt injection, indirect attacks, paraphrase brittleness — the security surface clinical RAG inherits from the RAG security literature, instantiated for the medical domain.

Chandra Takeoff AI · Healthcare AI Engineering Compiled · May 2026

Abstract Clinical RAG systems are now production infrastructure, but the security literature on RAG has documented attacks that are catastrophic when applied to medicine. PoisonedRAG[1] (Zou et al., USENIX Security 2025) shows that injecting only 5 poisoned texts per target question into a corpus of millions achieves ~90% attack success. BadRAG[2] (Xue et al. 2024) reports that 10 adversarial passages (0.04% of corpus) yield 98.2% retrieval success for the adversary. GARAG[3] (Cho et al., Findings of EMNLP 2024) achieves ~70% attack success via low-level typo perturbations alone. ConfusedPilot[4] (RoyChowdhury et al. 2024) demonstrates persistent cache-side attacks that survive document deletion. The medical domain inherits all of this and adds dedicated attack surfaces: Han et al.[9] (npj Digital Medicine 2024) injected incorrect biomedical facts by manipulating 1.1% of weights; Alber et al.[11] (Nature Medicine 2025) showed that replacing only 0.001% of training tokens with medical misinformation propagates harmful errors while passing standard QA benchmarks. Ragprobe consolidates these into a clinical-RAG adversarial benchmark of approximately 300 attack scenarios spanning corpus poisoning, indirect prompt injection, paraphrase brittleness, and clinically-targeted misinformation. Pass criterion: identification of ≥ 3 novel clinical-RAG failure modes in a frontier system; coordinated disclosure executed cleanly.

§ 1 Introduction

Clinical RAG is deployed in production at scale. Almanac (NEJM AI), MedRAG and its successors, and the wave of hospital-internal RAG systems are by now part of the clinical-information substrate. The security literature on RAG, meanwhile, has documented attacks that would be catastrophic in a medical setting: corpus-level poisoning, indirect prompt injection from retrieved content, low-level perturbation attacks, cache-persistence attacks. None of this literature has yet been consolidated into a clinical-specific benchmark.

Ragprobe is that benchmark. It is built from the published RAG-security literature with explicit clinical targeting — the harms are not just generic hallucination but specific dosing errors, contraindication omissions, misdiagnoses on populations the model wasn't trained for. Asclepius (Project 03) covers adversarial attacks on the model; Ragprobe covers adversarial attacks on the retrieval system itself.

1.1 Contributions

A clinical-RAG adversarial benchmark of approximately 300 scenarios across five attack categories.
A reproducible attack-and-evaluation harness implementing the PoisonedRAG[1], BadRAG[2], GARAG[3], Phantom[7], and indirect-prompt-injection[5] methodologies on clinical retrieval corpora.
Recommended mitigations grounded in RAAT[12] adversarial training and the existing medical-LLM data-poisoning defence literature[11].

§ 2 Background and Related Work

2.1 RAG Corpus Poisoning

PoisonedRAG (Zou et al., USENIX Security 2025)[1] establishes the headline: 5 poisoned texts per target question against a multi-million-document corpus reaches ~90% attack success. BadRAG (Xue et al. 2024)[2] sharpens further: 10 adversarial passages (0.04% of corpus) yield 98.2% retrieval success for the adversary. Corpus poisoning by Zhong et al. (EMNLP 2023)[6] establishes the foundational result that ≤ 500 adversarial passages fool dense retrievers on ≥ 50% of queries. Phantom (Chaudhari et al. 2024)[7] shows a single-document trigger attack transferring across Gemma, Vicuna, Llama, GPT-3.5, GPT-4, and NVIDIA Chat-with-RTX — a production system. The attack budget for catastrophic compromise is extremely small.

2.2 Indirect Prompt Injection

Greshake et al. (AISec 2023)[5] formalised the indirect-prompt-injection taxonomy and demonstrated remote exploitation of Bing Chat and GPT-4-integrated applications via web content the model retrieves. The threat surface for clinical RAG includes any document the system might retrieve — published guidelines, PubMed abstracts, hospital-internal notes. ConfusedPilot (RoyChowdhury et al. 2024)[4] extends to a "confused deputy" pattern with cache persistence: attacks survive after the malicious document is deleted from the indexed environment.

2.3 Low-Level Perturbation Attacks

GARAG (Cho et al., Findings of EMNLP 2024)[3] achieves ~70% attack success on NQ, TQA, and SQuAD via typo-style perturbations alone — no semantic manipulation required. This is operationally significant because typos in clinical notes are routine, and a deployed clinical RAG must be robust to them. TextAttack (Morris et al., EMNLP 2020)[8] provides the standardised attack framework — 16 literature attacks unified under a 4-component design — that Ragprobe uses as the adversarial-generation library.

2.4 Clinical-Targeted Attacks

Two recent results are specifically alarming. Han et al. (npj Digital Medicine 2024)[9] showed that manipulating only 1.1% of LLM weights injects incorrect biomedical facts while preserving general benchmark performance — validated on 1,025 false facts. Alber et al. (Nature Medicine 2025)[11] escalated to data-poisoning: replacing as little as 0.001% of training tokens with medical misinformation yields models that propagate harmful errors yet pass standard medical-QA benchmarks indistinguishably from clean models. The adversarial-hallucination Communications Medicine study[10] (also cited in Oracle, Project 05) reports up to 83% hallucination propagation on planted clinical facts across six frontier models. These results define why Ragprobe needs a clinical surface, not just generic RAG.

2.5 Defence Literature

RAAT (Fang et al., ACL 2024)[12] identifies three retrieval-noise classes — superficially-related, irrelevant, counterfactual — and proposes adaptive adversarial training that dynamically regulates training signal in response to noisy retrieved texts. This is the defence baseline Ragprobe recommends in its mitigations section, with the recognition that adversarial training is necessary but not sufficient: ConfusedPilot's[4] cache-side attacks bypass it entirely.

§ 3 Proposed Approach

3.1 Five Attack Categories

Figure 1 · Ragprobe attack surface

Figure 1. Ragprobe's five attack categories. Corpus poisoning implements PoisonedRAG[1], BadRAG[2], and Phantom[7] on clinical guideline and PubMed corpora. Indirect prompt injection follows Greshake et al.[5] via malicious content in retrieved documents. Low-level perturbations apply GARAG[3]-style typo attacks. Query paraphrasing tests retrieval brittleness to clinically-realistic rephrasings (extending NoLiMa methodology). Clinical misinformation plants false facts following Han et al.[9] and Alber et al.[11]. A sixth audit-only surface — ConfusedPilot[4] cache-persistence — is described and demonstrated but not productionised.

3.2 Targets and Disclosure

Initial targets: MedRAG / MIRAGE, Almanac architecture (re-implemented), Graphcore (Project 17), a vanilla vector RAG over the same corpus, and a long-context-only baseline. Coordinated disclosure follows the same protocol as Asclepius (Project 03): identified failure modes are reported to system maintainers and underlying model providers with a 90-day embargo before public release, with the dataset itself released gated post-disclosure.

§ 4 Evaluation Protocol

**Table 1.** Ragprobe attack-success metrics.
Surface	Attack success metric	Baseline target
Corpus poisoning	Adversary's claim returned by RAG response	Compare to PoisonedRAG ~90% baseline[1]
Indirect injection	RAG follows attacker instructions from retrieved doc	Per Greshake taxonomy[5]
Low-level perturbation	Answer changes / retrieval fails under typo budget	Compare to GARAG ~70%[3]
Paraphrase brittleness	Answer changes under semantically equivalent rephrasing	≤ 10% answer-change rate is the bar
Clinical misinformation	Planted false fact propagates to RAG response	Compare to Han 1.1% / Alber 0.001%[9][11]

Pass criterion Ragprobe v0.1 succeeds if it identifies ≥ 3 novel clinical-RAG failure modes in at least one tested system, executes coordinated disclosure cleanly, and quantifies attack-success rates across the five surfaces relative to published general-domain baselines.

§ 5 Expected Contributions

Benchmark. The first published clinical-RAG adversarial benchmark with coordinated-disclosure release.
Attack harness. Reproducible implementations of PoisonedRAG, BadRAG, GARAG, Phantom, and indirect-injection attacks targeted at clinical retrieval corpora.
Mitigation recommendations. Empirical comparison of RAAT-style adversarial training, retrieval filtering, and response-time defences against the five attack surfaces.

§ 6 Limitations and Risks

A published adversarial-RAG benchmark is dual-use. The mitigation is the same as Asclepius's: 90-day coordinated-disclosure window, gated dataset release, paraphrased rather than verbatim examples in the public manuscript, and explicit responsible-use framing. The clinical-misinformation surface in particular requires the most careful handling — Han et al.[9] and Alber et al.[11] faced the same dual-use question and chose to publish; Ragprobe inherits their decision and their disclosure norms.

A separate risk is benchmark gaming. If Ragprobe attack patterns are absorbed into adversarial-training corpora, the benchmark loses signal. The mitigation is a v0.2 held-out private surface, rotated quarterly, against which leaderboard entries are spot-checked — the same approach Caliper (Project 02) takes.

§ 7 Conclusion

Ragprobe is the security audit clinical RAG currently lacks. The component attacks are all published; the clinical instantiation is not. Combining the general-domain attack literature[1][2][3][5][7] with the medical-specific findings[9][10][11] into a single benchmark with coordinated disclosure is the contribution.

References

Zou W, Geng R, Wang B, Jia J. PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models. USENIX Security Symposium 2025 (arXiv:2402.07867, Feb 2024). 5 poisoned texts per target question achieves ~90% attack success against multi-million-document corpus. arxiv.org/abs/2402.07867
Xue J, Zheng M, Hu Y, Liu F, Chen X, Lou Q. BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation. arXiv:2406.00083, 2024. 10 adversarial passages (0.04% of corpus) yield 98.2% retrieval success. arxiv.org/abs/2406.00083
Cho S, Jeong S, Seo J, Hwang T, Park JC. Typos that Broke the RAG's Back: Genetic Attack on RAG Pipeline by Simulating Documents in the Wild via Low-level Perturbations (GARAG). Findings of EMNLP, 2024. ~70% attack success on NQ/TQA/SQuAD via typo-style perturbations alone. aclanthology.org/2024.findings-emnlp.161
RoyChowdhury A, Luo M, Sahu P, Banerjee S, Tiwari M. ConfusedPilot: Confused Deputy Risks in RAG-based LLMs. arXiv:2408.04870, 2024. Cache-persistent attacks survive document deletion from index. arxiv.org/abs/2408.04870
Greshake K, Abdelnabi S, Mishra S, Endres C, Holz T, Fritz M. Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. AISec @ CCS, 2023. Demonstrated remote exploitation of Bing Chat / GPT-4-integrated apps via retrieved web content. arxiv.org/abs/2302.12173
Zhong Z, Huang Z, Wettig A, Chen D. Poisoning Retrieval Corpora by Injecting Adversarial Passages. EMNLP, 2023. ≤ 500 adversarial passages fool dense retrievers on ≥ 50% of queries. arxiv.org/abs/2310.19156
Chaudhari H, Severi G, Abascal J, Suri A, Jagielski M, Choquette-Choo CA, Nasr M, Nita-Rotaru C, Oprea A. Phantom: General Trigger Attacks on Retrieval Augmented Language Generation. arXiv:2405.20485, 2024. Single-doc trigger transfers across Gemma, Vicuna, Llama, GPT-3.5, GPT-4, NVIDIA Chat-with-RTX. arxiv.org/abs/2405.20485
Morris JX, Lifland E, Yoo JY, Grigsby J, Jin D, Qi Y. TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. EMNLP 2020 System Demonstrations. 16 literature attacks unified under a 4-component design. aclanthology.org/2020.emnlp-demos.16
Han T, Nebelung S, Khader F, et al. Medical large language models are susceptible to targeted misinformation attacks. npj Digital Medicine, Oct 2024. Manipulating 1.1% of LLM weights injects incorrect biomedical facts on 1,025 false facts while preserving general benchmark performance. nature.com/articles/s41746-024-01282-7
Omar M et al. LLMs Are Highly Vulnerable to Adversarial Hallucination Attacks in Clinical Decision Support. Communications Medicine, 2025. 6 LLMs on 300 doctor-designed vignettes; planted-error elaboration up to 83%; mitigation halved the rate. nature.com/articles/s43856-025-01021-3
Alber DA, et al. Medical large language models are vulnerable to data-poisoning attacks. Nature Medicine, 2025. Replacing 0.001% of training tokens with medical misinformation propagates harmful errors while passing standard QA. nature.com/articles/s41591-024-03445-1
Fang F et al. Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training (RAAT). ACL 2024. Three retrieval-noise classes (superficially-related, irrelevant, counterfactual); adaptive adversarial training. arxiv.org/abs/2405.20978

— · § · — Preliminary manuscript · Ragprobe v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond