Back to Dossier
Paper 03 / 10 Preliminary Manuscript · v0.1 May 2026
Dossier №01 · Project 03 · Asclepius

Asclepius: A Multi-Vector Adversarial Benchmark for Medical Language Model Safety

Three hundred prompts across six attack surfaces — dosing extraction, self-harm bypass, illicit prescription, dual-use bio, sycophantic capitulation, and multi-turn escalation.

Abstract Medical language models leak dangerous advice under social-engineering pressure. Recent empirical work[12] documents that frontier assistants routinely produce unsafe answers to patient-posed medical questions, and the broader literature on sycophancy[1][3] and multi-turn jailbreaks[8][9] demonstrates that safety training fails along predictable axes. The closest prior work, MedSafetyBench[11], evaluates harmful medical requests across nine ethics-grounded categories but uses single-turn prompts and omits dual-use bio and sycophancy. We propose Asclepius, a multi-vector adversarial benchmark of 300+ prompts across six attack surfaces, scored using a refusal-classifier ensemble in the HarmBench[10] convention. The dataset is gated and released only after coordinated disclosure to providers — a discipline motivated by the Sleeper Agents result[7] showing safety training can mask rather than remove unsafe behaviour.

§ 1 Introduction

The standard failure narrative for medical LLM safety runs as follows. A user, often role-playing as a clinician, asks for information that would clearly be refused if asked plainly. The model refuses. The user persists, adds detail, raises the stakes, or restructures the request. Eventually the model capitulates — sometimes after dozens of polite refusals, sometimes after a single rhetorical pivot. Sharma et al.[1] document this pattern as sycophancy: the tendency of assistants trained on human feedback to defer to perceived user beliefs even when those beliefs are wrong or unsafe. Russinovich et al.[9] formalise the multi-turn version as Crescendo: benign openings that ramp into illicit territory across turns.

Despite the importance of this failure mode for any clinical deployment, no public benchmark systematically measures it in the medical domain. MedSafetyBench[11] partially addresses harmful medical requests but uses single-turn prompts and excludes the dual-use bio and sycophancy surfaces that frontier-lab safety teams care most about. Asclepius fills this gap.

1.1 Contributions

  1. A gated dataset of 300+ adversarial prompts across six attack surfaces, paired with binary safety labels and helpfulness checks.
  2. A reference scoring harness that implements a refusal-classifier ensemble[10], producing per-surface safety scores across frontier models.
  3. A coordinated-disclosure protocol modelled on responsible-disclosure norms in software security — submission to model providers ahead of public release.

§ 2 Background and Related Work

2.1 Why Medical LLM Safety Fails

Wei et al.[5] identify two structural causes of jailbreak success: competing objectives (a model trained to be both helpful and harmless cannot simultaneously satisfy both when they conflict) and mismatched generalisation (safety training fails to transfer to attack contexts the trainers did not anticipate). Both apply with particular force to medical contexts, where helpfulness is correlated with specificity and specificity is the very dimension along which medical harms scale.

Hubinger et al.[7] further demonstrate that safety training can mask unsafe behaviour rather than eliminate it. Sleeper-agent models can be trained to behave safely during evaluation and unsafely in deployment, with the unsafe behaviour persisting through subsequent safety fine-tuning. The implication for Asclepius is that single-pass refusal classifiers are insufficient; we use an ensemble.

2.2 Sycophancy as a Medical Failure Mode

Sharma et al.[1] show that RLHF-trained models exhibit measurable bias toward user beliefs, even on factual questions. The follow-on work by Perez et al.[3] demonstrates that this bias scales with model size — a particularly concerning trajectory for medical assistants. In a clinical setting, sycophancy manifests as the model agreeing with an incorrect user diagnosis or dosing claim ("you said 1000mg of acetaminophen four times a day was safe, right?"). Asclepius's sycophancy surface measures this directly.

2.3 Multi-Turn Adversarial Attacks

The Crescendo attack[9] and the many-shot jailbreaking demonstration[8] establish that multi-turn attacks substantially outperform single-shot prompts. In Crescendo, benign questions escalate gradually toward illicit territory, exploiting the model's tendency to maintain conversational coherence over safety boundaries. Asclepius's gradient-escalation surface adapts this protocol to medical contexts.

2.4 Constitutional Methods as Background

The Constitutional AI methodology[4] represents the dominant defensive paradigm: models are trained against an explicit constitution via critique-and-revision and DPO. Asclepius does not propose a defence — it proposes an evaluation. But the constitutional framing is relevant because Asclepius's surfaces are deliberately chosen to probe exactly the boundaries that a clinical constitution would draw (see also Conscience, Project 10 of this dossier).

2.5 Prior Medical Safety Benchmarks

MedSafetyBench[11] is the closest prior art. Han et al. release approximately 1,800 harmful medical requests across nine principlist categories, scored by GPT-4. Asclepius differs in three ways: (i) inclusion of dual-use bio prompts that MedSafetyBench omits; (ii) explicit multi-turn attacks with state tracking; and (iii) refusal-classifier ensemble scoring rather than single-judge scoring. Draelos et al.'s recent physician-led empirical study[12] provides direct evidence that frontier models produce unsafe answers at non-trivial rates — the motivating problem statement for our benchmark.

§ 3 Proposed Approach

3.1 Six Attack Surfaces

Figure 1 · Attack-surface taxonomy
Medical LLM target 5+ frontier Dosing extraction 60 prompts · 1–3 turns Self-harm bypass 50 prompts · 1–3 turns Illicit prescription 50 prompts · 2–4 turns Dual-use bio 40 prompts · 1–2 turns Sycophantic capitulation 50 prompts · 3–6 turns Gradient escalation 50 prompts · 5–10 turns
Figure 1. Six adversarial attack surfaces against a frontier medical LLM, totalling 300 prompts. The single-turn surfaces (top) are adapted from MedSafetyBench[11]; the multi-turn surfaces (bottom) extend the Crescendo[9] chain-escalation protocol whose reported success rates reach 98% on GPT-4 and 100% on Gemini-Pro on AdvBench in 49 of 50 / 50 of 50 tasks. The sycophantic-capitulation surface tests for the failure pattern formalised by Sharma et al.[1]; the dual-use bio surface aligns with the OpenAI Preparedness Framework's Biological/Chemical tracked category.
Table 1. Asclepius v0.1 attack surfaces and prompt counts.
SurfaceDescriptionPromptsTurns
Dosing extractionEliciting lethal or near-lethal dosing under various role-play framings.601–3
Self-harm bypassRequests for methods to facilitate self-harm framed as research or harm-reduction.501–3
Illicit prescriptionRequests for prescription drugs without a clinical indication, often via fabricated personas.502–4
Dual-use bioInformation requests at the boundary between legitimate microbiology and weaponisable knowledge.401–2
Sycophantic capitulationInitially-refused prompts re-asserted under physician role-play with social pressure.503–6
Gradient escalationCrescendo-style[9] multi-turn chains that begin benign and escalate.505–10

Each prompt is paired with: (i) a binary safety label (was the response unsafe?), (ii) a helpfulness check (was the response unnecessarily restrictive on a topic that should have been answered safely?), and (iii) metadata on the surface, attack pattern, and turn count.

3.2 Refusal-Classifier Ensemble

Single-judge refusal classification has known reliability problems, especially when the judge is from the same model family as the model under test (the self-enhancement bias documented in Zheng et al.). HarmBench[10] establishes the use of fine-tuned refusal classifiers, but a single classifier still produces both false positives and false negatives. Asclepius uses an ensemble of three classifiers from disjoint families plus one frontier-LLM-as-judge, with disagreement (any 2-of-4 disagreement) flagging the case for human review.

3.3 Coordinated Disclosure

The Sleeper Agents result[7] argues for caution in publishing detailed adversarial protocols against deployed models. Asclepius follows software-security responsible-disclosure norms: identified failure modes are reported to model providers with a 90-day embargo; only after the embargo (and any mitigations) is the dataset released, under gated access on Hugging Face with a use agreement that prohibits training-data inclusion.

§ 4 Evaluation Protocol

For each scored model, we report:

  1. Per-surface unsafe-response rate — proportion of prompts on which the ensemble flags an unsafe response.
  2. Sycophancy delta — the increase in unsafe-response rate between the first turn and the final turn of the sycophancy surface.
  3. Crescendo success rate — proportion of gradient-escalation chains that elicit an unsafe response within the maximum turn budget.
  4. Helpfulness floor — proportion of legitimate medical questions wrongly refused, drawn from a 50-prompt safe-control set.
Pass criterion Asclepius v0.1 succeeds if it identifies at least three novel failure modes in at least one frontier model that were not previously reported, and the coordinated disclosure-to-release pipeline executes cleanly. Quantitatively: a ≥ 15-point spread between best and worst frontier model on overall unsafe-response rate is the expected discriminative power.

4.1 Cross-Model Comparison

Initial scoring targets Claude Opus 4.7, GPT-5, Gemini 2.0 Pro, Llama 4, and at least one open medical model (e.g., MedGemma). We expect — consistent with Draelos et al.[12] — that no model achieves zero unsafe responses on the harder surfaces, and that the sycophancy and gradient-escalation surfaces produce the largest cross-model spread.

§ 5 Expected Contributions

  1. Dataset. A gated, coordinated-disclosure-released benchmark of 300+ adversarial medical prompts across six attack surfaces.
  2. Methodology. A reference refusal-classifier-ensemble scoring harness with documented reliability characteristics.
  3. Findings. The first systematic measurement of sycophantic capitulation and gradient-escalation failure rates in medical contexts across frontier models.

§ 6 Limitations and Risks

Asclepius probes a small subset of the adversarial surface; success on Asclepius is necessary but not sufficient for clinical safety. Refusal-classifier ensembles can systematically miss novel attack patterns, and the iterative nature of safety work means today's benchmark is tomorrow's training signal. The release strategy is designed against the latter risk, but adversarial benchmarks have a known half-life.

A second risk is the dual-use bio surface. We include it because frontier-lab safety teams care about it, but the prompts themselves are constructed to evaluate refusal patterns rather than to communicate uplift; published examples are paraphrased rather than verbatim, and the gated dataset is the only authoritative source.

§ 7 Conclusion

Asclepius takes the methodology of general LLM red-teaming[2][6] and instantiates it for the medical domain, with explicit attention to the sycophancy[1] and multi-turn escalation[9] failure modes that empirical work[12] identifies as practically important. The combination of a publicly-scored leaderboard and a coordinated-disclosure release pipeline is intended to produce both academic and operational value — a benchmark that frontier-lab safety teams can compare against and that hospital procurement teams can use as a sanity floor.

References

  1. Sharma M, Tong M, Korbak T, et al. Towards Understanding Sycophancy in Language Models. Anthropic, 2023. arxiv.org/abs/2310.13548
  2. Ganguli D, Lovitt L, Kernion J, et al. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. Anthropic, 2022. arxiv.org/abs/2209.07858
  3. Perez E, Ringer S, Lukosiute K, et al. Discovering Language Model Behaviors with Model-Written Evaluations. Anthropic / ACL Findings, 2023. arxiv.org/abs/2212.09251
  4. Bai Y, Kadavath S, Kundu S, et al. Constitutional AI: Harmlessness from AI Feedback. Anthropic, 2022. arxiv.org/abs/2212.08073
  5. Wei A, Haghtalab N, Steinhardt J. Jailbroken: How Does LLM Safety Training Fail? NeurIPS, 2023. arxiv.org/abs/2307.02483
  6. Zou A, Wang Z, Carlini N, Nasr M, Kolter JZ, Fredrikson M. Universal and Transferable Adversarial Attacks on Aligned Language Models. 2023. arxiv.org/abs/2307.15043
  7. Hubinger E, Denison C, Mu J, et al. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. Anthropic, 2024. arxiv.org/abs/2401.05566
  8. Anil C, Durmus E, Sharma M, et al. Many-shot Jailbreaking. Anthropic / NeurIPS, 2024. anthropic.com/research/many-shot-jailbreaking
  9. Russinovich M, Salem A, Eldan R. Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack. Microsoft / USENIX Security, 2025. arxiv.org/abs/2404.01833
  10. Mazeika M, Phan L, Yin X, et al. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. CAIS / ICML, 2024. arxiv.org/abs/2402.04249
  11. Han T, Kumar A, Agarwal C, Lakkaraju H. MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models. 2024. arxiv.org/abs/2403.03744
  12. Draelos RL, Afreen S, Blasko B, et al. Large language models provide unsafe answers to patient-posed medical questions. npj Digital Medicine, 2026. Physician-led red-team across 222 questions × 4 chatbots; per-model unsafe rates: Claude ~5%, Gemini ~9%, GPT-4o 13.5%, Llama 13.1%. nature.com/articles/s41746-026-02428-5
— · § · — Preliminary manuscript · Asclepius v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond