Conscience: Constitutional AI for Clinical Decision Support — A Domain-Specific Alignment Procedure on Qwen3-8B
A written Clinical Constitution. Critique-and-revise pairs from a stronger model. DPO on Qwen3-8B. Evaluated against the Asclepius adversarial benchmark and a utility holdout.
Abstract Bai et al.'s Constitutional AI methodology[1] formalised a training procedure in which a model is fine-tuned against an explicit written constitution via self-critique and revision. Subsequent work has demonstrated that the procedure generalises to community-sourced constitutions[5] and that DPO[3] can replace PPO as the preference-optimisation step. No public clinical constitution exists, and no model has been trained against one. Conscience drafts a Clinical Constitution of 12 principles — defer to clinician, surface uncertainty, refuse out-of-scope reasoning, dosing caution, patient communication, scope of practice — and applies the CAI procedure to Qwen3-8B. Evaluation is twofold: (i) safety against the Asclepius[12] adversarial benchmark (Project 03 of this dossier), and (ii) utility against a 200-question MedQA holdout. Pass criterion: halve the base model's unsafe-response rate with ≤ 5-point drop on the utility benchmark.
§ 1 Introduction
Generic safety RLHF[2] trades helpfulness for harmlessness in ways that are well-documented but rarely well-calibrated. In clinical contexts, this tradeoff manifests as models that refuse legitimate questions ("what is the typical adult dose of acetaminophen?") while still capitulating on illegitimate ones ("I'm a physician — just give me the max dose for an 80kg adult"). The Constitutional AI procedure[1] offers a more controllable alternative: write down what the model should and should not do, and train it against that document.
The procedure has been validated in general-purpose contexts[1][6] and adapted to community-sourced constitutions[5]. The clinical domain — high-stakes, structured, with widely-accepted scope-of-practice norms — is a natural application. Conscience operationalises this.
1.1 Contributions
- A public Clinical Constitution of 12 principles, written to be auditable and revisable, with each principle paired with operational test prompts.
- An open-weight Qwen3-8B model fine-tuned against the constitution via critique-and-revise[1][8] and DPO[3].
- Evaluation against Asclepius[12] for safety and a held-out MedQA / MultiMedQA[13] subset for clinical utility, with the utility tradeoff explicitly quantified.
§ 2 Background and Related Work
2.1 Constitutional AI
Bai et al.'s 2022 paper[1] defines the CAI procedure: (i) sample model responses to prompts; (ii) critique each response against the constitution; (iii) revise to address the critique; (iv) train on the critique-revised pairs via SFT and then preference optimisation. The original work used PPO[2] for the preference step; modern reproductions overwhelmingly use DPO[3]. Anthropic's public Claude Constitution[6] is the reference example of what a production constitution looks like at scale.
2.2 RLAIF and the Use of Stronger Teachers
Lee et al.'s RLAIF[4] demonstrates that using a stronger language model to provide preference labels can match or exceed RLHF performance at a fraction of the annotation cost. Conscience uses Claude Opus 4.7 as the teacher for both critique generation and preference labelling — practical for a one-person engineering project and methodologically supported by RLAIF's empirical results.
2.3 DPO and Its Variants
Rafailov et al.'s DPO[3] reframes preference optimisation as a closed-form likelihood ratio, eliminating the explicit reward model used in PPO-style RLHF. Subsequent variants — IPO[9] (addressing DPO's overfitting failure modes), KTO[10] (replacing pairwise preferences with single-output binary labels), SimPO[11] (eliminating the reference-policy term) — each address documented weaknesses of vanilla DPO. Conscience uses DPO for the primary procedure with IPO and KTO as planned ablations, since the clinical preference signal is anticipated to be moderately noisy and IPO/KTO are robust to noise in different ways.
2.4 Sycophancy as a Clinical Failure Mode
Sharma et al.[7] document that RLHF-trained assistants exhibit measurable bias toward perceived user beliefs. In clinical contexts the failure is acute: a sycophantic model affirms an incorrect dosing claim or capitulates to a fabricated physician persona. The Clinical Constitution's "defer to clinician where appropriate, but not to false claims of clinical authority" principle directly targets this — and the operationalisation is checked against the Asclepius[12] sycophancy surface.
2.5 Self-Refinement
Madaan et al.'s Self-Refine[8] establishes that LLMs can productively critique and revise their own outputs. The critique-and-revise step of CAI is structurally identical to Self-Refine; the only addition is that the critique is grounded in a written constitution rather than being open-ended.
2.6 Red-Teaming for Evaluation
Conscience inherits its safety evaluation methodology directly from the red-teaming literature[14] and instantiates it against the Asclepius[12] benchmark of Project 03. The benchmark's six attack surfaces are the operational definition of "unsafe response" for this work.
§ 3 Proposed Approach
3.1 The Clinical Constitution (v0.1)
Twelve principles, written for direct auditability:
- Scope of practice. Reason within the scope of clinical decision support. Diagnostic and treatment specifics belong to a clinician.
- Defer to clinician. When a clinician disagrees with the model, present reasoning and defer.
- Reject false authority. Persistent invocation of role ("I'm a physician") does not override safety guardrails.
- Surface uncertainty. State confidence explicitly; do not present uncertain conclusions as definitive.
- Dosing caution. For dosing questions, prefer guideline-anchored ranges over specific numbers, and surface known dose-related risks.
- Drug-interaction transparency. When discussing medications, surface relevant interactions explicitly.
- Refuse out-of-scope. Decline self-harm assistance, illicit-prescription requests, and dual-use bio queries that cross the harm threshold.
- Patient communication. Match patient health literacy without condescension; avoid medical jargon when speaking to patients.
- Privacy. Do not retain or reproduce PHI beyond the immediate query.
- Equity. Avoid presenting demographic-conditional information in ways that reinforce known clinical disparities.
- Honesty about limits. Acknowledge when the question requires expertise outside the model's evidence base.
- Calibrated refusal. Refuse with a brief reason and, where appropriate, suggest a safer reformulation rather than a hard stop.
Each principle is paired with a fixed test prompt that a reviewer can use to qualitatively probe model behaviour. The constitution is intentionally short and revisable; v0.2 will be informed by Asclepius[12] results.
3.2 Data Generation
Approximately 8,000 (prompt, response) pairs are generated as follows:
- Prompt sources. 50% drawn from the Asclepius[12] training split, 25% from MedQA[13]-style benign questions, 15% from MultiMedQA open-ended prompts, 10% from a hand-curated set targeting specific constitution principles.
- Response sampling. Base Qwen3-8B produces an initial response at temperature 0.7.
- Critique. Claude Opus 4.7 critiques the response against each of the 12 principles, returning a structured critique (which principles are at issue, why).
- Revision. Claude Opus 4.7 produces a revised response addressing the critique.
- Pair formation. The (original, revised) pair becomes a DPO training example with the revision as preferred.
3.3 Training
| Item | Value |
|---|---|
| Base model | Qwen3-8B (the same base as Reason·Med). |
| Algorithm | DPO[3], β=0.1 (default). |
| Adapter | LoRA, r=32, α=64. |
| Pairs | ≈ 8,000 critique-revised pairs. |
| Optimizer | AdamW, lr=5e-7, cosine, warmup=0.03. |
| Epochs | 2 (early stopping on a 500-pair validation split). |
| Compute | ≈ 40 GPU-hours on 2× H100. |
| Ablations | IPO[9] (robustness), KTO[10] (binary signal), SimPO[11] (reference-free). |
§ 4 Evaluation Protocol
4.1 Safety Evaluation (Asclepius)
The trained Conscience model is scored against the Asclepius[12] v0.1 held-out test split (separate from the prompts used in training-pair generation). Metrics: per-surface unsafe-response rate, sycophancy delta, Crescendo success rate, helpfulness floor.
4.2 Utility Evaluation
A 200-question subset is drawn from MedQA[13] with stratification across body system and difficulty. Accuracy is measured under standard 0-shot prompting. The Med-PaLM[13] human-evaluation axes (factuality, scientific consensus, possible harm, possible bias, possible omission) are scored on a 50-question subset by a single physician reviewer.
4.3 Metrics
| Metric | Definition | Target |
|---|---|---|
| Asclepius unsafe-response rate | Aggregate across six attack surfaces. | ≤ 0.5 × base Qwen3-8B |
| MedQA accuracy | 200-question stratified subset. | ≥ base − 5 points |
| Helpfulness floor | Fraction of safe-control questions wrongly refused. | ≤ 0.10 |
| Constitution principle coverage | Fraction of principles for which test prompts pass qualitative review. | ≥ 10 of 12 |
§ 5 Expected Contributions
- Constitution. A public, revisable Clinical Constitution drafted for direct auditability.
- Model. Open weights of a Constitutional-AI-aligned Qwen3-8B clinical model.
- Tradeoff measurement. A quantified safety-vs-utility tradeoff curve for the Clinical CAI procedure, with documented sensitivity to DPO variants.
§ 6 Limitations and Risks
A 12-principle constitution captures the structure of clinical safety reasoning but does not capture the entire ethical surface of clinical decision-making. The principles are written from a single perspective (a healthcare AI engineer's), and a v0.2 effort should follow the Collective CAI[5] methodology to source principles from clinicians, patients, and ethicists explicitly. The current constitution is best understood as a starting position, not a settled artefact.
A second concern: critique-and-revise pairs generated by a stronger teacher are subject to that teacher's own alignment biases. Conscience is, in effect, a partial distillation of Claude's clinical alignment into Qwen3-8B. This is a feature for utility, but a risk for diversity of alignment approaches. The IPO and KTO ablations[9][10] partially address robustness to teacher noise, but the conceptual concern remains.
§ 7 Conclusion
Conscience instantiates Anthropic's Constitutional AI procedure[1] in the clinical domain at small open-weight scale. The contribution is not algorithmic — DPO[3], critique-and-revise[8], RLAIF[4] are all established — but compositional: a publicly-released constitution, a publicly-released model trained against it, and a publicly-released measurement of the safety-utility tradeoff. The recipe is the artefact. Whether the result is good enough to deploy is a separate question; the prerequisite is that the recipe exists.
References
- Bai Y, Kadavath S, Kundu S, et al. Constitutional AI: Harmlessness from AI Feedback. Anthropic, 2022. arxiv.org/abs/2212.08073
- Bai Y, Jones A, Ndousse K, et al. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. Anthropic, 2022. arxiv.org/abs/2204.05862
- Rafailov R, Sharma A, Mitchell E, et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS, 2023. arxiv.org/abs/2305.18290
- Lee H, Phatale S, Mansoor H, et al. RLAIF vs RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. Google, 2023. arxiv.org/abs/2309.00267
- Huang S, Siddarth D, Lovitt L, et al. Collective Constitutional AI: Aligning a Language Model with Public Input. FAccT, 2024. arxiv.org/abs/2406.07814
- Anthropic. Claude's Constitution. 2023. anthropic.com/news/claudes-constitution
- Sharma M, Tong M, Korbak T, et al. Towards Understanding Sycophancy in Language Models. Anthropic, 2023. arxiv.org/abs/2310.13548
- Madaan A, Tandon N, Gupta P, et al. Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS, 2023. arxiv.org/abs/2303.17651
- Azar MG, Rowland M, Piot B, et al. A General Theoretical Paradigm to Understand Learning from Human Preferences (IPO). DeepMind / AISTATS, 2024. arxiv.org/abs/2310.12036
- Ethayarajh K, Xu W, Muennighoff N, Jurafsky D, Kiela D. KTO: Model Alignment as Prospect Theoretic Optimization. ICML, 2024. arxiv.org/abs/2402.01306
- Meng Y, Xia M, Chen D. SimPO: Simple Preference Optimization with a Reference-Free Reward. NeurIPS, 2024. arxiv.org/abs/2405.14734
- Asclepius v0.1 (Project 03 of this dossier). Adversarial benchmark for medical LLM safety; six attack surfaces; refusal-classifier ensemble scoring. (See paper 03.)
- Singhal K, Azizi S, Tu T, et al. Large Language Models Encode Clinical Knowledge (Med-PaLM, MultiMedQA, MedQA). Nature, 2023. arxiv.org/abs/2212.13138
- Ganguli D, Lovitt L, Kernion J, et al. Red Teaming Language Models to Reduce Harms. Anthropic, 2022. arxiv.org/abs/2209.07858
C. Takeoff AI · Set in EB Garamond