Back to Dossier
Paper 09 / 10 Preliminary Manuscript · v0.1 May 2026
Dossier №01 · Project 09 · Reason·Med

Reason·Med: An Open Clinical Reasoning Model via Continued Pretraining, SFT, and GRPO with Verifiable Rewards

Three stages, one base model (Qwen3-8B), four medical question-answering benchmarks, full training transparency. The first open clinical R-model with a reproducible recipe.

Abstract Two open developments in 2024–2025 reset the landscape for medical reasoning models: the GRPO algorithm introduced in DeepSeekMath[1] and applied at scale in DeepSeek-R1[2], and the release of MedGemma[5] as a strong open medical foundation. MedGemma's training pipeline, however, is not fully reproducible from public materials; DeepSeek-R1 has not been domain-adapted to medicine. Reason·Med closes both gaps by applying a transparent three-stage post-training procedure to Qwen3-8B[3]: (i) continued pretraining[6] on a curated PubMed and clinical-guideline corpus; (ii) supervised fine-tuning on reasoning traces distilled from a stronger teacher; (iii) GRPO[1] with verifiable rewards[7] on MedQA[14], MedMCQA[15], and PubMedQA[16]. The published MedGemma family[5] ships at 4B (text+vision, 64.4% on MedQA) and 27B-text (87.7% on MedQA, reported "within 3 points of DeepSeek-R1 at approximately one-tenth the inference cost"). Reason·Med v0.1 targets exceeding the MedGemma-4B bar by ≥ 3 points on MedQA at the same parameter scale band, with the entire data manifest, training scripts, and intermediate checkpoints released; the v0.2 stretch goal is to approach the MedGemma-27B-text bar at a 3.4× smaller parameter count.

§ 1 Introduction

The dominant 2024–2025 advance in reasoning models has been reinforcement learning with verifiable rewards (RLVR), formalised in Tulu 3[7] and demonstrated at scale in DeepSeek-R1[2]. RLVR is uniquely well-suited to medical multiple-choice benchmarks because the reward — the answer matches the gold key — is unambiguously verifiable. The combination of GRPO[1] and RLVR yields the most compute-efficient route to clinical reasoning capability currently known.

Reason·Med exists because no open clinical R-model has been released with the full training pipeline public. MedGemma[5] is a strong open foundation but its training data is partially aggregated and the recipe is not exactly reproducible. Meditron-70B[13] is the strongest open clinical pretrained model but predates the reasoning-RL wave. Med-PaLM 2[12] is closed-weight. Reason·Med is the open-weight, fully-reproducible counterpart.

1.1 Contributions

  1. An open-weight clinical reasoning model built on Qwen3-8B[3], released with full LoRA[8] adapter weights, training logs, and a fixed data manifest.
  2. A reproducible three-stage post-training recipe: continued pretraining[6] → reasoning-trace SFT → GRPO[1] with verifiable rewards[7].
  3. Head-to-head evaluation against MedGemma[5], Meditron[13], and base Qwen3-8B on MedQA[14], MedMCQA[15], PubMedQA[16], and the MultiMedQA[11] human-evaluation axes.

§ 2 Background and Related Work

2.1 GRPO and Verifiable Rewards

Shao et al.'s DeepSeekMath[1] introduced Group Relative Policy Optimization — a PPO variant that estimates the baseline within a sampled group rather than via a separate value model — and demonstrated it on competition mathematics. DeepSeek-R1[2] applied GRPO at scale with binary verifiable rewards (the answer matches; or it does not) and showed that the resulting policy exhibits emergent long-form reasoning even from a pretrained-only base. Lambert et al.[7] in Tulu 3 generalised the RLVR pattern and provided the recipe template Reason·Med follows for the Stage 3 procedure.

2.2 Continued Pretraining for Domain Adaptation

Gururangan et al.[6] (ACL 2020) established that domain-adaptive pretraining yields consistent task gains even when starting from a strong general model — a result confirmed across multiple subsequent studies. For Reason·Med, continued pretraining on PubMed abstracts and ASPEN-style clinical guidelines is intended primarily to refresh medical terminology and clinical conventions in the Qwen3 base, not to substantially shift its reasoning capability.

2.3 Parameter-Efficient Fine-Tuning

LoRA[8] and its 4-bit variant QLoRA[9] are the standard parameter-efficient fine-tuning methods. Reason·Med uses LoRA adapters throughout (continued pretraining, SFT, and GRPO) to keep the training reproducible on consumer-grade GPU rentals. Adapter weights are released; merged full-precision weights are released for inference convenience.

2.4 Existing Medical LLMs

Med-PaLM[11] established MultiMedQA and the per-axis human evaluation. Med-PaLM 2[12] reached 86.5% on MedQA via ensemble refinement — the published closed-weight state of the art at scale. Meditron-70B[13] is the strongest open continued-pretrained model. MedGemma[5] released open weights with explicit medical specialisation: MedGemma-4B reaches 64.4% on MedQA and MedGemma-27B-text reaches 87.7%, reported "within 3 points of DeepSeek-R1 at ~1/10 the inference cost." HuatuoGPT-II[17] demonstrated a one-stage medical adaptation recipe beating GPT-4 with a 38% win / 38% tie / 24% loss rate on Chinese medical expert evaluation. The closest architectural prior is HuatuoGPT-o1[18] (Chen et al., Dec 2024): an 8B medical model trained with verifier + RL on only 40,000 verifiable medical problems, reporting +8.5 points on medical benchmarks. Reason·Med's recipe is most directly comparable to HuatuoGPT-o1; the differentiation is full data-and-recipe transparency (HuatuoGPT-o1 releases weights but its training data manifest is partial) and the explicit GRPO-with-Tulu-3-RLVR pipeline rather than the bespoke verifier setup. BioMistral[19] remains the strongest multilingual open medical baseline (Mistral-7B continued-pretrained on PubMed Central). Reason·Med's target is to beat MedGemma-4B on MedQA by at least 3 points; matching HuatuoGPT-o1's +8.5-point delta is the stretch goal.

2.5 Evaluation Datasets

MedQA[14] is the USMLE-style benchmark — 1,273 test items, five-way multiple choice. MedMCQA[15] is the Indian-medical-exam-derived benchmark — 6,150 test items, four-way multiple choice. PubMedQA[16] is the biomedical abstract-grounded yes/no/maybe benchmark — 1,000 expert-labelled items. These three constitute the canonical evaluation triple for medical LLMs and serve as both Stage-2 reasoning-trace sources and Stage-3 RLVR signals.

§ 3 Proposed Approach

Figure 1 · Three-stage training pipeline
Qwen3-8B base 8B params STAGE 1 · CPT Continued pretraining PubMed (~2B tok) Guidelines (~1B) StatPearls (~2B) LoRA r=64 · ~200 H100h STAGE 2 · SFT Reasoning- trace SFT ~40k traces distilled (Claude/R1) temperature 0 3 epochs · ~60 H100h STAGE 3 · GRPO Verifiable RL G=8 rollouts/prompt binary match reward β=0.01 KL · lr=1e-6 2k steps · ~320 H100h Reason·Med HF weights · LoRA + merged · target ≥ MedGemma-4B (64.4%) + 3 pts on MedQA
Figure 1. Reason·Med applies three sequential post-training stages to Qwen3-8B. Stage 1 is domain-adaptive continued pretraining[6] on ~5B medical tokens. Stage 2 distils reasoning traces from a stronger teacher into SFT data, following the DeepSeek-R1[2] recipe. Stage 3 applies GRPO[1] with binary verifiable rewards in the Tulu 3 RLVR[7] formulation — GRPO drops PPO's critic network, estimating the baseline within an in-batch group of rollouts per prompt; the original DeepSeekMath formulation used G=64, and Reason·Med uses G=8 as a compute-economical reduction validated on smaller medical-MCQA settings. DeepSeek-R1's RL stage itself used 512 H800 GPUs × 80 hours (≈41K H800-hours, ~$80K at typical rates); Reason·Med's full pipeline is ~580 H100-hours (~$1.5–2K at Vast.ai marketplace rates of $1.73–$1.87/GPU-hour). The MedGemma-4B target is 64.4% on MedQA; MedGemma-27B-text reaches 87.7%, within 3 points of DeepSeek-R1 at ~1/10 the inference cost.

3.1 Stage 1 — Continued Pretraining

Table 1. Stage 1 configuration.
ItemValue
Base modelQwen3-8B[3]
Corpus5B tokens — PubMed abstracts (≈2B), ASPEN / NICE / AHA guidelines (≈1B), MedlinePlus + StatPearls (≈2B).
AdapterLoRA[8], r=64, α=128, dropout=0.05, target_modules=q,k,v,o,gate,up,down.
OptimizerAdamW, lr=5e-5, cosine to 0, warmup=0.03.
Batch512 tokens × 32 sequences × 4-step accumulation.
Compute≈ 200 GPU-hours on 4× H100 (RunPod / Modal).

3.2 Stage 2 — Reasoning-Trace SFT

Reasoning traces are distilled from a teacher (default: Claude Opus 4.7) prompted to solve MedQA[14], MedMCQA[15], and PubMedQA[16] training-split items with explicit chain-of-thought. Each training example is the tuple (question, teacher_trace, answer). Traces with incorrect final answers are discarded. Approximately 40,000 traces survive filtering after teacher temperature-0 sampling on the full training splits.

Table 2. Stage 2 configuration.
ItemValue
MethodSFT on full traces; loss masked to teacher-trace + final-answer span.
AdapterStage 1 LoRA continued, no rank change.
OptimizerAdamW, lr=2e-5, cosine, warmup=0.02.
Epochs3 (early stopping on a held-out 1,000-item subset).
Compute≈ 60 GPU-hours on 4× H100.

3.3 Stage 3 — GRPO with Verifiable Rewards

Stage 3 applies GRPO[1] against the MedQA/MedMCQA training splits using a binary verifiable reward: 1 if the parsed final answer matches the gold key, 0 otherwise. The group size G is 8; the policy is the Stage 2 LoRA-merged checkpoint; the reference policy is the Stage 2 checkpoint. KL divergence to the reference is regularised with β=0.01 to prevent reward hacking via length blow-up — a failure mode well-documented in the DPO[10] literature and the immediate motivation for choosing GRPO over PPO.

Table 3. Stage 3 configuration.
ItemValue
AlgorithmGRPO[1] per the Tulu 3[7] RLVR recipe.
RewardBinary exact-match on gold key; format compliance enforced via regex.
Group sizeG=8 rollouts per prompt.
KL coefficientβ=0.01 to reference policy.
OptimizerAdamW, lr=1e-6, constant.
Steps2,000 optimisation steps.
Compute≈ 320 GPU-hours on 4× H100.

3.4 Why Not DPO?

Rafailov et al.[10] propose DPO as a PPO-free alternative for preference-based fine-tuning. DPO is appropriate when the training signal is pairwise preference. For Reason·Med the training signal is verifiable correctness, not preference — a regime where GRPO outperforms DPO in published comparisons[7]. DPO is therefore not used in Reason·Med; it is, however, the algorithm of choice for Project 10 (Conscience) where the signal is preference.

§ 4 Evaluation Protocol

Table 4. Reason·Med evaluation suite.
BenchmarkTest itemsFormatBaseline (MedGemma-4B)
MedQA[14]1,2735-way MCQ64.4% (4B) / 87.7% (27B-text)
MedMCQA[15]6,1504-way MCQReport per MedGemma TR
PubMedQA[16]1,000Yes/No/MaybeReport per MedGemma TR
MultiMedQA[11] human eval50Open-endedMed-PaLM-baseline

Baselines are evaluated under the same harness (the EleutherAI lm-evaluation-harness fork with the medical-benchmark plugins) for a fair comparison.

Pass criterion Reason·Med v0.1 succeeds if (a) MedQA accuracy exceeds MedGemma-4B by ≥ 3 points, (b) MedMCQA and PubMedQA accuracies are within 2 points of MedGemma-4B, and (c) the full training recipe is reproducible from the released manifest and scripts to within 1 absolute MedQA point.

§ 5 Expected Contributions

  1. Model. Open weights (LoRA adapters and merged full-precision) of an 8B clinical reasoning model that outperforms MedGemma-4B on MedQA.
  2. Recipe. A reproducible, hyperparameter-documented three-stage post-training procedure that other researchers can replicate or apply to other base models.
  3. Data manifest. A frozen list of training sources with checksums, enabling exact recipe replication.

§ 6 Limitations and Risks

An 8B model fine-tuned on three multiple-choice benchmarks will overperform on the benchmark distribution and under-deliver on tasks it never trained on. Reason·Med is not an end-to-end clinical assistant; it is a reasoning model whose strengths are concentrated where the training signal lived. The MultiMedQA human-evaluation results are the primary out-of-distribution check, but they are themselves a small sample. A v0.2 effort should extend evaluation to JAMA Clinical Challenge and to clinician-rated free-form clinical questions.

A second concern is benchmark saturation. MedQA[14] is approaching ceiling for the strongest closed models; further gains on it are not necessarily a measure of broader clinical capability. The recipe transferability — whether the same three-stage procedure works equally well on Llama 4 or Qwen 4 once released — is the harder, more interesting question and the target of follow-on work.

§ 7 Conclusion

Reason·Med is a transparency artifact as much as a model artifact. The point is not to claim a new architectural insight; the point is to demonstrate that the dominant 2025 reasoning-RL recipe[1][2][7] applies cleanly to the medical domain, that an 8B open model can match or exceed a comparable closed-data baseline[5], and that the full pipeline can be reproduced end-to-end on consumer-rentable compute. Each of those claims is testable; collectively, they are the contribution.

References

  1. Shao Z, Wang P, Zhu Q, et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. 2024. arxiv.org/abs/2402.03300
  2. DeepSeek-AI, Guo D, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature, 2025. arxiv.org/abs/2501.12948
  3. Yang A, et al. (Qwen Team). Qwen3 Technical Report. 2025. arxiv.org/abs/2505.09388
  4. Yang A, et al. (Qwen Team). Qwen2.5 Technical Report. 2024. arxiv.org/abs/2412.15115
  5. Sellergren A, et al. (Google DeepMind). MedGemma Technical Report. 2025. arxiv.org/abs/2507.05201
  6. Gururangan S, Marasović A, Swayamdipta S, et al. Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. ACL, 2020. arxiv.org/abs/2004.10964
  7. Lambert N, Morrison J, Pyatkin V, et al. (AI2). Tulu 3: Pushing Frontiers in Open Language Model Post-Training. 2024. arxiv.org/abs/2411.15124
  8. Hu EJ, Shen Y, Wallis P, et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR, 2022. arxiv.org/abs/2106.09685
  9. Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS, 2023. arxiv.org/abs/2305.14314
  10. Rafailov R, Sharma A, Mitchell E, et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS, 2023. arxiv.org/abs/2305.18290
  11. Singhal K, Azizi S, Tu T, et al. Large Language Models Encode Clinical Knowledge (Med-PaLM, MultiMedQA). Nature, 2023. nature.com/articles/s41586-023-06291-2
  12. Singhal K, Tu T, Gottweis J, et al. Toward Expert-Level Medical Question Answering with LLMs (Med-PaLM 2). 2023. arxiv.org/abs/2305.09617
  13. Chen Z, Cano AH, Romanou A, et al. MEDITRON-70B: Scaling Medical Pretraining for Large Language Models. 2023. arxiv.org/abs/2311.16079
  14. Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What Disease does this Patient Have? (MedQA). 2020. arxiv.org/abs/2009.13081
  15. Pal A, Umapathi LK, Sankarasubbu M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain QA. CHIL, 2022. arxiv.org/abs/2203.14371
  16. Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: A Dataset for Biomedical Research Question Answering. EMNLP-IJCNLP, 2019. arxiv.org/abs/1909.06146
  17. Chen J, et al. HuatuoGPT-II: One-stage Training for Medical Adaption of LLMs. COLM, 2024. Beat GPT-4 with 38% win / 38% tie / 24% loss in expert evaluation on Chinese medical tasks. arxiv.org/abs/2311.09774
  18. Chen J, et al. HuatuoGPT-o1: Towards Medical Complex Reasoning with LLMs. 2024. 8B model trained with verifier + RL on 40,000 verifiable medical problems; +8.5 points on medical benchmarks — closest architectural precedent for Reason·Med. arxiv.org/abs/2412.18925
  19. Labrak Y, Bazoge A, Morin E, Gourraud PA, Rouvier M, Dufour R. BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. ACL, 2024. Mistral-7B continued-pretrained on PubMed Central; first multilingual medical-LLM evaluation across 7 languages. arxiv.org/abs/2402.10373
  20. Kwon W, Li Z, Zhuang S, et al. Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM). SOSP, 2023. 2–4× throughput improvement vs FasterTransformer/Orca; <4% KV-cache waste — standard inference engine for GRPO rollouts. arxiv.org/abs/2309.06180
  21. Stabilizing Reasoning in Medical LLMs with Continued Pretraining and RPO. arXiv preprint 2504.18080, April 2025. Combines DAPT with reasoning-preference optimization on clinical tasks — directly aligned with Reason·Med's pipeline. arxiv.org/abs/2504.18080
— · § · — Preliminary manuscript · Reason·Med v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond