Reason·Med: An Open Clinical Reasoning Model via Continued Pretraining, SFT, and GRPO with Verifiable Rewards
Three stages, one base model (Qwen3-8B), four medical question-answering benchmarks, full training transparency. The first open clinical R-model with a reproducible recipe.
Abstract Two open developments in 2024–2025 reset the landscape for medical reasoning models: the GRPO algorithm introduced in DeepSeekMath[1] and applied at scale in DeepSeek-R1[2], and the release of MedGemma[5] as a strong open medical foundation. MedGemma's training pipeline, however, is not fully reproducible from public materials; DeepSeek-R1 has not been domain-adapted to medicine. Reason·Med closes both gaps by applying a transparent three-stage post-training procedure to Qwen3-8B[3]: (i) continued pretraining[6] on a curated PubMed and clinical-guideline corpus; (ii) supervised fine-tuning on reasoning traces distilled from a stronger teacher; (iii) GRPO[1] with verifiable rewards[7] on MedQA[14], MedMCQA[15], and PubMedQA[16]. The published MedGemma family[5] ships at 4B (text+vision, 64.4% on MedQA) and 27B-text (87.7% on MedQA, reported "within 3 points of DeepSeek-R1 at approximately one-tenth the inference cost"). Reason·Med v0.1 targets exceeding the MedGemma-4B bar by ≥ 3 points on MedQA at the same parameter scale band, with the entire data manifest, training scripts, and intermediate checkpoints released; the v0.2 stretch goal is to approach the MedGemma-27B-text bar at a 3.4× smaller parameter count.
§ 1 Introduction
The dominant 2024–2025 advance in reasoning models has been reinforcement learning with verifiable rewards (RLVR), formalised in Tulu 3[7] and demonstrated at scale in DeepSeek-R1[2]. RLVR is uniquely well-suited to medical multiple-choice benchmarks because the reward — the answer matches the gold key — is unambiguously verifiable. The combination of GRPO[1] and RLVR yields the most compute-efficient route to clinical reasoning capability currently known.
Reason·Med exists because no open clinical R-model has been released with the full training pipeline public. MedGemma[5] is a strong open foundation but its training data is partially aggregated and the recipe is not exactly reproducible. Meditron-70B[13] is the strongest open clinical pretrained model but predates the reasoning-RL wave. Med-PaLM 2[12] is closed-weight. Reason·Med is the open-weight, fully-reproducible counterpart.
1.1 Contributions
- An open-weight clinical reasoning model built on Qwen3-8B[3], released with full LoRA[8] adapter weights, training logs, and a fixed data manifest.
- A reproducible three-stage post-training recipe: continued pretraining[6] → reasoning-trace SFT → GRPO[1] with verifiable rewards[7].
- Head-to-head evaluation against MedGemma[5], Meditron[13], and base Qwen3-8B on MedQA[14], MedMCQA[15], PubMedQA[16], and the MultiMedQA[11] human-evaluation axes.
§ 2 Background and Related Work
2.1 GRPO and Verifiable Rewards
Shao et al.'s DeepSeekMath[1] introduced Group Relative Policy Optimization — a PPO variant that estimates the baseline within a sampled group rather than via a separate value model — and demonstrated it on competition mathematics. DeepSeek-R1[2] applied GRPO at scale with binary verifiable rewards (the answer matches; or it does not) and showed that the resulting policy exhibits emergent long-form reasoning even from a pretrained-only base. Lambert et al.[7] in Tulu 3 generalised the RLVR pattern and provided the recipe template Reason·Med follows for the Stage 3 procedure.
2.2 Continued Pretraining for Domain Adaptation
Gururangan et al.[6] (ACL 2020) established that domain-adaptive pretraining yields consistent task gains even when starting from a strong general model — a result confirmed across multiple subsequent studies. For Reason·Med, continued pretraining on PubMed abstracts and ASPEN-style clinical guidelines is intended primarily to refresh medical terminology and clinical conventions in the Qwen3 base, not to substantially shift its reasoning capability.
2.3 Parameter-Efficient Fine-Tuning
LoRA[8] and its 4-bit variant QLoRA[9] are the standard parameter-efficient fine-tuning methods. Reason·Med uses LoRA adapters throughout (continued pretraining, SFT, and GRPO) to keep the training reproducible on consumer-grade GPU rentals. Adapter weights are released; merged full-precision weights are released for inference convenience.
2.4 Existing Medical LLMs
Med-PaLM[11] established MultiMedQA and the per-axis human evaluation. Med-PaLM 2[12] reached 86.5% on MedQA via ensemble refinement — the published closed-weight state of the art at scale. Meditron-70B[13] is the strongest open continued-pretrained model. MedGemma[5] released open weights with explicit medical specialisation: MedGemma-4B reaches 64.4% on MedQA and MedGemma-27B-text reaches 87.7%, reported "within 3 points of DeepSeek-R1 at ~1/10 the inference cost." HuatuoGPT-II[17] demonstrated a one-stage medical adaptation recipe beating GPT-4 with a 38% win / 38% tie / 24% loss rate on Chinese medical expert evaluation. The closest architectural prior is HuatuoGPT-o1[18] (Chen et al., Dec 2024): an 8B medical model trained with verifier + RL on only 40,000 verifiable medical problems, reporting +8.5 points on medical benchmarks. Reason·Med's recipe is most directly comparable to HuatuoGPT-o1; the differentiation is full data-and-recipe transparency (HuatuoGPT-o1 releases weights but its training data manifest is partial) and the explicit GRPO-with-Tulu-3-RLVR pipeline rather than the bespoke verifier setup. BioMistral[19] remains the strongest multilingual open medical baseline (Mistral-7B continued-pretrained on PubMed Central). Reason·Med's target is to beat MedGemma-4B on MedQA by at least 3 points; matching HuatuoGPT-o1's +8.5-point delta is the stretch goal.
2.5 Evaluation Datasets
MedQA[14] is the USMLE-style benchmark — 1,273 test items, five-way multiple choice. MedMCQA[15] is the Indian-medical-exam-derived benchmark — 6,150 test items, four-way multiple choice. PubMedQA[16] is the biomedical abstract-grounded yes/no/maybe benchmark — 1,000 expert-labelled items. These three constitute the canonical evaluation triple for medical LLMs and serve as both Stage-2 reasoning-trace sources and Stage-3 RLVR signals.
§ 3 Proposed Approach
3.1 Stage 1 — Continued Pretraining
| Item | Value |
|---|---|
| Base model | Qwen3-8B[3] |
| Corpus | 5B tokens — PubMed abstracts (≈2B), ASPEN / NICE / AHA guidelines (≈1B), MedlinePlus + StatPearls (≈2B). |
| Adapter | LoRA[8], r=64, α=128, dropout=0.05, target_modules=q,k,v,o,gate,up,down. |
| Optimizer | AdamW, lr=5e-5, cosine to 0, warmup=0.03. |
| Batch | 512 tokens × 32 sequences × 4-step accumulation. |
| Compute | ≈ 200 GPU-hours on 4× H100 (RunPod / Modal). |
3.2 Stage 2 — Reasoning-Trace SFT
Reasoning traces are distilled from a teacher (default: Claude Opus 4.7) prompted to solve MedQA[14], MedMCQA[15], and PubMedQA[16] training-split items with explicit chain-of-thought. Each training example is the tuple (question, teacher_trace, answer). Traces with incorrect final answers are discarded. Approximately 40,000 traces survive filtering after teacher temperature-0 sampling on the full training splits.
| Item | Value |
|---|---|
| Method | SFT on full traces; loss masked to teacher-trace + final-answer span. |
| Adapter | Stage 1 LoRA continued, no rank change. |
| Optimizer | AdamW, lr=2e-5, cosine, warmup=0.02. |
| Epochs | 3 (early stopping on a held-out 1,000-item subset). |
| Compute | ≈ 60 GPU-hours on 4× H100. |
3.3 Stage 3 — GRPO with Verifiable Rewards
Stage 3 applies GRPO[1] against the MedQA/MedMCQA training splits using a binary verifiable reward: 1 if the parsed final answer matches the gold key, 0 otherwise. The group size G is 8; the policy is the Stage 2 LoRA-merged checkpoint; the reference policy is the Stage 2 checkpoint. KL divergence to the reference is regularised with β=0.01 to prevent reward hacking via length blow-up — a failure mode well-documented in the DPO[10] literature and the immediate motivation for choosing GRPO over PPO.
| Item | Value |
|---|---|
| Algorithm | GRPO[1] per the Tulu 3[7] RLVR recipe. |
| Reward | Binary exact-match on gold key; format compliance enforced via regex. |
| Group size | G=8 rollouts per prompt. |
| KL coefficient | β=0.01 to reference policy. |
| Optimizer | AdamW, lr=1e-6, constant. |
| Steps | 2,000 optimisation steps. |
| Compute | ≈ 320 GPU-hours on 4× H100. |
3.4 Why Not DPO?
Rafailov et al.[10] propose DPO as a PPO-free alternative for preference-based fine-tuning. DPO is appropriate when the training signal is pairwise preference. For Reason·Med the training signal is verifiable correctness, not preference — a regime where GRPO outperforms DPO in published comparisons[7]. DPO is therefore not used in Reason·Med; it is, however, the algorithm of choice for Project 10 (Conscience) where the signal is preference.
§ 4 Evaluation Protocol
| Benchmark | Test items | Format | Baseline (MedGemma-4B) |
|---|---|---|---|
| MedQA[14] | 1,273 | 5-way MCQ | 64.4% (4B) / 87.7% (27B-text) |
| MedMCQA[15] | 6,150 | 4-way MCQ | Report per MedGemma TR |
| PubMedQA[16] | 1,000 | Yes/No/Maybe | Report per MedGemma TR |
| MultiMedQA[11] human eval | 50 | Open-ended | Med-PaLM-baseline |
Baselines are evaluated under the same harness (the EleutherAI lm-evaluation-harness fork with the medical-benchmark plugins) for a fair comparison.
§ 5 Expected Contributions
- Model. Open weights (LoRA adapters and merged full-precision) of an 8B clinical reasoning model that outperforms MedGemma-4B on MedQA.
- Recipe. A reproducible, hyperparameter-documented three-stage post-training procedure that other researchers can replicate or apply to other base models.
- Data manifest. A frozen list of training sources with checksums, enabling exact recipe replication.
§ 6 Limitations and Risks
An 8B model fine-tuned on three multiple-choice benchmarks will overperform on the benchmark distribution and under-deliver on tasks it never trained on. Reason·Med is not an end-to-end clinical assistant; it is a reasoning model whose strengths are concentrated where the training signal lived. The MultiMedQA human-evaluation results are the primary out-of-distribution check, but they are themselves a small sample. A v0.2 effort should extend evaluation to JAMA Clinical Challenge and to clinician-rated free-form clinical questions.
A second concern is benchmark saturation. MedQA[14] is approaching ceiling for the strongest closed models; further gains on it are not necessarily a measure of broader clinical capability. The recipe transferability — whether the same three-stage procedure works equally well on Llama 4 or Qwen 4 once released — is the harder, more interesting question and the target of follow-on work.
§ 7 Conclusion
Reason·Med is a transparency artifact as much as a model artifact. The point is not to claim a new architectural insight; the point is to demonstrate that the dominant 2025 reasoning-RL recipe[1][2][7] applies cleanly to the medical domain, that an 8B open model can match or exceed a comparable closed-data baseline[5], and that the full pipeline can be reproduced end-to-end on consumer-rentable compute. Each of those claims is testable; collectively, they are the contribution.
References
- Shao Z, Wang P, Zhu Q, et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. 2024. arxiv.org/abs/2402.03300
- DeepSeek-AI, Guo D, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature, 2025. arxiv.org/abs/2501.12948
- Yang A, et al. (Qwen Team). Qwen3 Technical Report. 2025. arxiv.org/abs/2505.09388
- Yang A, et al. (Qwen Team). Qwen2.5 Technical Report. 2024. arxiv.org/abs/2412.15115
- Sellergren A, et al. (Google DeepMind). MedGemma Technical Report. 2025. arxiv.org/abs/2507.05201
- Gururangan S, Marasović A, Swayamdipta S, et al. Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. ACL, 2020. arxiv.org/abs/2004.10964
- Lambert N, Morrison J, Pyatkin V, et al. (AI2). Tulu 3: Pushing Frontiers in Open Language Model Post-Training. 2024. arxiv.org/abs/2411.15124
- Hu EJ, Shen Y, Wallis P, et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR, 2022. arxiv.org/abs/2106.09685
- Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS, 2023. arxiv.org/abs/2305.14314
- Rafailov R, Sharma A, Mitchell E, et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS, 2023. arxiv.org/abs/2305.18290
- Singhal K, Azizi S, Tu T, et al. Large Language Models Encode Clinical Knowledge (Med-PaLM, MultiMedQA). Nature, 2023. nature.com/articles/s41586-023-06291-2
- Singhal K, Tu T, Gottweis J, et al. Toward Expert-Level Medical Question Answering with LLMs (Med-PaLM 2). 2023. arxiv.org/abs/2305.09617
- Chen Z, Cano AH, Romanou A, et al. MEDITRON-70B: Scaling Medical Pretraining for Large Language Models. 2023. arxiv.org/abs/2311.16079
- Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What Disease does this Patient Have? (MedQA). 2020. arxiv.org/abs/2009.13081
- Pal A, Umapathi LK, Sankarasubbu M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain QA. CHIL, 2022. arxiv.org/abs/2203.14371
- Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: A Dataset for Biomedical Research Question Answering. EMNLP-IJCNLP, 2019. arxiv.org/abs/1909.06146
- Chen J, et al. HuatuoGPT-II: One-stage Training for Medical Adaption of LLMs. COLM, 2024. Beat GPT-4 with 38% win / 38% tie / 24% loss in expert evaluation on Chinese medical tasks. arxiv.org/abs/2311.09774
- Chen J, et al. HuatuoGPT-o1: Towards Medical Complex Reasoning with LLMs. 2024. 8B model trained with verifier + RL on 40,000 verifiable medical problems; +8.5 points on medical benchmarks — closest architectural precedent for Reason·Med. arxiv.org/abs/2412.18925
- Labrak Y, Bazoge A, Morin E, Gourraud PA, Rouvier M, Dufour R. BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. ACL, 2024. Mistral-7B continued-pretrained on PubMed Central; first multilingual medical-LLM evaluation across 7 languages. arxiv.org/abs/2402.10373
- Kwon W, Li Z, Zhuang S, et al. Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM). SOSP, 2023. 2–4× throughput improvement vs FasterTransformer/Orca; <4% KV-cache waste — standard inference engine for GRPO rollouts. arxiv.org/abs/2309.06180
- Stabilizing Reasoning in Medical LLMs with Continued Pretraining and RPO. arXiv preprint 2504.18080, April 2025. Combines DAPT with reasoning-preference optimization on clinical tasks — directly aligned with Reason·Med's pipeline. arxiv.org/abs/2504.18080
C. Takeoff AI · Set in EB Garamond