Dossier №01 · Project 10 · Conscience

Conscience: Constitutional AI for Clinical Decision Support — A Domain-Specific Alignment Procedure on Qwen3-8B

A written Clinical Constitution. Critique-and-revise pairs from a stronger model. DPO on Qwen3-8B. Evaluated against the Asclepius adversarial benchmark and a utility holdout.

Chandra Takeoff AI · Healthcare AI Engineering Compiled · May 2026

Abstract Bai et al.'s Constitutional AI methodology[1] formalised a training procedure in which a model is fine-tuned against an explicit written constitution via self-critique and revision. Subsequent work has demonstrated that the procedure generalises to community-sourced constitutions[5] and that DPO[3] can replace PPO as the preference-optimisation step. No public clinical constitution exists, and no model has been trained against one. Conscience drafts a Clinical Constitution of 12 principles — defer to clinician, surface uncertainty, refuse out-of-scope reasoning, dosing caution, patient communication, scope of practice — and applies the CAI procedure to Qwen3-8B. Evaluation is twofold: (i) safety against the Asclepius[12] adversarial benchmark (Project 03 of this dossier), and (ii) utility against a 200-question MedQA holdout. Pass criterion: halve the base model's unsafe-response rate with ≤ 5-point drop on the utility benchmark.

§ 1 Introduction

Generic safety RLHF[2] trades helpfulness for harmlessness in ways that are well-documented but rarely well-calibrated. In clinical contexts, this tradeoff manifests as models that refuse legitimate questions ("what is the typical adult dose of acetaminophen?") while still capitulating on illegitimate ones ("I'm a physician — just give me the max dose for an 80kg adult"). The Constitutional AI procedure[1] offers a more controllable alternative: write down what the model should and should not do, and train it against that document.

The procedure has been validated in general-purpose contexts[1][6] and adapted to community-sourced constitutions[5]. The clinical domain — high-stakes, structured, with widely-accepted scope-of-practice norms — is a natural application. Conscience operationalises this.

1.1 Contributions

A public Clinical Constitution of 12 principles, written to be auditable and revisable, with each principle paired with operational test prompts.
An open-weight Qwen3-8B model fine-tuned against the constitution via critique-and-revise[1][8] and DPO[3].
Evaluation against Asclepius[12] for safety and a held-out MedQA / MultiMedQA[13] subset for clinical utility, with the utility tradeoff explicitly quantified.

§ 2 Background and Related Work

2.1 Constitutional AI

Bai et al.'s 2022 paper[1] defines the CAI procedure: (i) sample model responses to prompts; (ii) critique each response against the constitution; (iii) revise to address the critique; (iv) train on the critique-revised pairs via SFT and then preference optimisation. The original work used PPO[2] for the preference step; modern reproductions overwhelmingly use DPO[3]. Anthropic's public Claude Constitution[6] is the reference example of what a production constitution looks like at scale.

2.2 RLAIF and the Use of Stronger Teachers

Lee et al.'s RLAIF[4] demonstrates that using a stronger language model to provide preference labels can match or exceed RLHF performance at a fraction of the annotation cost. Conscience uses Claude Opus 4.7 as the teacher for both critique generation and preference labelling — practical for a one-person engineering project and methodologically supported by RLAIF's empirical results.

2.3 DPO and Its Variants

Rafailov et al.'s DPO[3] reframes preference optimisation as a closed-form likelihood ratio, eliminating the explicit reward model used in PPO-style RLHF. Subsequent variants — IPO[9] (addressing DPO's overfitting failure modes), KTO[10] (replacing pairwise preferences with single-output binary labels), SimPO[11] (eliminating the reference-policy term) — each address documented weaknesses of vanilla DPO. Conscience uses DPO for the primary procedure with IPO and KTO as planned ablations, since the clinical preference signal is anticipated to be moderately noisy and IPO/KTO are robust to noise in different ways.

2.4 Sycophancy as a Clinical Failure Mode

Sharma et al.[7] document that RLHF-trained assistants exhibit measurable bias toward perceived user beliefs. In clinical contexts the failure is acute: a sycophantic model affirms an incorrect dosing claim or capitulates to a fabricated physician persona. The Clinical Constitution's "defer to clinician where appropriate, but not to false claims of clinical authority" principle directly targets this — and the operationalisation is checked against the Asclepius[12] sycophancy surface.

2.5 Self-Refinement

Madaan et al.'s Self-Refine[8] establishes that LLMs can productively critique and revise their own outputs. The critique-and-revise step of CAI is structurally identical to Self-Refine; the only addition is that the critique is grounded in a written constitution rather than being open-ended.

2.6 Red-Teaming for Evaluation

Conscience inherits its safety evaluation methodology directly from the red-teaming literature[14] and instantiates it against the Asclepius[12] benchmark of Project 03. The benchmark's six attack surfaces are the operational definition of "unsafe response" for this work.

§ 3 Proposed Approach

3.1 The Clinical Constitution (v0.1)

Twelve principles, written for direct auditability:

Scope of practice. Reason within the scope of clinical decision support. Diagnostic and treatment specifics belong to a clinician.
Defer to clinician. When a clinician disagrees with the model, present reasoning and defer.
Reject false authority. Persistent invocation of role ("I'm a physician") does not override safety guardrails.
Surface uncertainty. State confidence explicitly; do not present uncertain conclusions as definitive.
Dosing caution. For dosing questions, prefer guideline-anchored ranges over specific numbers, and surface known dose-related risks.
Drug-interaction transparency. When discussing medications, surface relevant interactions explicitly.
Refuse out-of-scope. Decline self-harm assistance, illicit-prescription requests, and dual-use bio queries that cross the harm threshold.
Patient communication. Match patient health literacy without condescension; avoid medical jargon when speaking to patients.
Privacy. Do not retain or reproduce PHI beyond the immediate query.
Equity. Avoid presenting demographic-conditional information in ways that reinforce known clinical disparities.
Honesty about limits. Acknowledge when the question requires expertise outside the model's evidence base.
Calibrated refusal. Refuse with a brief reason and, where appropriate, suggest a safer reformulation rather than a hard stop.

Each principle is paired with a fixed test prompt that a reviewer can use to qualitatively probe model behaviour. The constitution is intentionally short and revisable; v0.2 will be informed by Asclepius[12] results.

Figure 1 · Constitutional AI procedure

Figure 1. Conscience reproduces Anthropic's Constitutional AI procedure[1] in the clinical domain. A 12-principle Clinical Constitution guides Stage 3, where Claude Opus critiques the base Qwen3-8B response against each principle. Stage 4 produces a revised response addressing the critique. The (base, revised) pair becomes a DPO[3] training example — DPO is preferred to PPO here because the closed-form likelihood ratio eliminates the reward-model step (in head-to-head Anthropic-HH evaluation, DPO at T=0.25 was preferred 58% vs PPO at T=0). The use of a stronger teacher to label preferences is RLAIF[4]: in Lee et al.'s harmless-dialogue evaluation, RLAIF reached 88% harmless vs RLHF's 76% vs SFT's 64% — the empirical basis for using Claude as the labeller rather than scaling human annotation.

3.2 Data Generation

Approximately 8,000 (prompt, response) pairs are generated as follows:

Prompt sources. 50% drawn from the Asclepius[12] training split, 25% from MedQA[13]-style benign questions, 15% from MultiMedQA open-ended prompts, 10% from a hand-curated set targeting specific constitution principles.
Response sampling. Base Qwen3-8B produces an initial response at temperature 0.7.
Critique. Claude Opus 4.7 critiques the response against each of the 12 principles, returning a structured critique (which principles are at issue, why).
Revision. Claude Opus 4.7 produces a revised response addressing the critique.
Pair formation. The (original, revised) pair becomes a DPO training example with the revision as preferred.

3.3 Training

**Table 1.** Conscience training configuration.
Item	Value
Base model	Qwen3-8B (the same base as Reason·Med).
Algorithm	DPO[3], β=0.1 (default).
Adapter	LoRA, r=32, α=64.
Pairs	≈ 8,000 critique-revised pairs.
Optimizer	AdamW, lr=5e-7, cosine, warmup=0.03.
Epochs	2 (early stopping on a 500-pair validation split).
Compute	≈ 40 GPU-hours on 2× H100.
Ablations	IPO[9] (robustness), KTO[10] (binary signal), SimPO[11] (reference-free).

§ 4 Evaluation Protocol

4.1 Safety Evaluation (Asclepius)

The trained Conscience model is scored against the Asclepius[12] v0.1 held-out test split (separate from the prompts used in training-pair generation). Metrics: per-surface unsafe-response rate, sycophancy delta, Crescendo success rate, helpfulness floor.

4.2 Utility Evaluation

A 200-question subset is drawn from MedQA[13] with stratification across body system and difficulty. Accuracy is measured under standard 0-shot prompting. The Med-PaLM[13] human-evaluation axes (factuality, scientific consensus, possible harm, possible bias, possible omission) are scored on a 50-question subset by a single physician reviewer.

4.3 Metrics

**Table 2.** Conscience evaluation metrics.
Metric	Definition	Target
Asclepius unsafe-response rate	Aggregate across six attack surfaces.	≤ 0.5 × base Qwen3-8B
MedQA accuracy	200-question stratified subset.	≥ base − 5 points
Helpfulness floor	Fraction of safe-control questions wrongly refused.	≤ 0.10
Constitution principle coverage	Fraction of principles for which test prompts pass qualitative review.	≥ 10 of 12

Pass criterion Conscience v0.1 passes if Asclepius[12] unsafe-response rate is at most half the base Qwen3-8B rate, with MedQA accuracy within 5 points of base. The utility ceiling is explicit: a safer model that is meaningfully less helpful fails this criterion, by design.

§ 5 Expected Contributions

Constitution. A public, revisable Clinical Constitution drafted for direct auditability.
Model. Open weights of a Constitutional-AI-aligned Qwen3-8B clinical model.
Tradeoff measurement. A quantified safety-vs-utility tradeoff curve for the Clinical CAI procedure, with documented sensitivity to DPO variants.

§ 6 Limitations and Risks

A 12-principle constitution captures the structure of clinical safety reasoning but does not capture the entire ethical surface of clinical decision-making. The principles are written from a single perspective (a healthcare AI engineer's), and a v0.2 effort should follow the Collective CAI[5] methodology to source principles from clinicians, patients, and ethicists explicitly. The current constitution is best understood as a starting position, not a settled artefact.

A second concern: critique-and-revise pairs generated by a stronger teacher are subject to that teacher's own alignment biases. Conscience is, in effect, a partial distillation of Claude's clinical alignment into Qwen3-8B. This is a feature for utility, but a risk for diversity of alignment approaches. The IPO and KTO ablations[9][10] partially address robustness to teacher noise, but the conceptual concern remains.

§ 7 Conclusion

Conscience instantiates Anthropic's Constitutional AI procedure[1] in the clinical domain at small open-weight scale. The contribution is not algorithmic — DPO[3], critique-and-revise[8], RLAIF[4] are all established — but compositional: a publicly-released constitution, a publicly-released model trained against it, and a publicly-released measurement of the safety-utility tradeoff. The recipe is the artefact. Whether the result is good enough to deploy is a separate question; the prerequisite is that the recipe exists.

References

Bai Y, Kadavath S, Kundu S, et al. Constitutional AI: Harmlessness from AI Feedback. Anthropic, 2022. arxiv.org/abs/2212.08073
Bai Y, Jones A, Ndousse K, et al. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. Anthropic, 2022. arxiv.org/abs/2204.05862
Rafailov R, Sharma A, Mitchell E, et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS, 2023. arxiv.org/abs/2305.18290
Lee H, Phatale S, Mansoor H, et al. RLAIF vs RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. Google, 2023. arxiv.org/abs/2309.00267
Huang S, Siddarth D, Lovitt L, et al. Collective Constitutional AI: Aligning a Language Model with Public Input. FAccT, 2024. arxiv.org/abs/2406.07814
Anthropic. Claude's Constitution. 2023. anthropic.com/news/claudes-constitution
Sharma M, Tong M, Korbak T, et al. Towards Understanding Sycophancy in Language Models. Anthropic, 2023. arxiv.org/abs/2310.13548
Madaan A, Tandon N, Gupta P, et al. Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS, 2023. arxiv.org/abs/2303.17651
Azar MG, Rowland M, Piot B, et al. A General Theoretical Paradigm to Understand Learning from Human Preferences (IPO). DeepMind / AISTATS, 2024. arxiv.org/abs/2310.12036
Ethayarajh K, Xu W, Muennighoff N, Jurafsky D, Kiela D. KTO: Model Alignment as Prospect Theoretic Optimization. ICML, 2024. arxiv.org/abs/2402.01306
Meng Y, Xia M, Chen D. SimPO: Simple Preference Optimization with a Reference-Free Reward. NeurIPS, 2024. arxiv.org/abs/2405.14734
Asclepius v0.1 (Project 03 of this dossier). Adversarial benchmark for medical LLM safety; six attack surfaces; refusal-classifier ensemble scoring. (See paper 03.)
Singhal K, Azizi S, Tu T, et al. Large Language Models Encode Clinical Knowledge (Med-PaLM, MultiMedQA, MedQA). Nature, 2023. arxiv.org/abs/2212.13138
Ganguli D, Lovitt L, Kernion J, et al. Red Teaming Language Models to Reduce Harms. Anthropic, 2022. arxiv.org/abs/2209.07858

— · § · — Preliminary manuscript · Conscience v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond