Caliper: A FHIR-Grounded Extension of HealthBench for Patient-Record Reasoning
Five hundred clinical tasks scored against an actual patient record, not a paragraph of prose.
Abstract HealthBench[1] established a standard for evaluating medical chat through physician-rubric-graded conversations, but its tasks are not grounded in a patient's actual structured record. We propose Caliper, a public benchmark of approximately five hundred tasks, each of which pairs a FHIR R4 bundle with a clinical question and a physician-style rubric. Tasks span medication reconciliation, abnormal-lab triage, problem-list reasoning, longitudinal trend detection, and adverse-event identification. We adopt a panel-of-judges scoring methodology[9] grounded in the LLM-as-judge literature[8], sourcing bundles from de-identified MIMIC-IV[6] and from Synthea[7]. Caliper differs from closed-form medical question benchmarks[3][4][5] by requiring grounded reasoning over a real FHIR resource graph, and from EHR few-shot benchmarks[10] by using open-ended rubric-scored generation rather than fixed prediction heads. The release target is a public leaderboard, an open dataset, and a peer-reviewed manuscript.
§ 1 Introduction
In May 2025, OpenAI released HealthBench[1], a 5,000-conversation evaluation in which physician-authored rubrics grade open-ended model responses. HealthBench moved medical model evaluation past multiple-choice and into the rubric-graded regime — a meaningful advance over MedQA[3], MedMCQA[4], and PubMedQA[5]. What HealthBench does not evaluate, however, is reasoning grounded in a patient's actual structured record. Its tasks consist of free-text scenarios; they do not require the model to navigate a FHIR resource graph the way a downstream clinical deployment would.
This is the gap Caliper fills. Caliper preserves HealthBench's rubric-graded methodology and extends it with two structural changes: every task is anchored to a FHIR R4[11] bundle, and scoring is performed by a panel of three diverse judge models[9] rather than a single arbiter. The result is a benchmark that tests the form of clinical reasoning a deployed system actually performs.
1.1 Contributions
- A public dataset of approximately five hundred FHIR-grounded clinical tasks, each with a deterministic gold answer and an open-ended physician-style rubric.
- A reference scoring harness that implements panel-of-judges evaluation[9] with audit traces, mitigating known LLM-as-judge biases[8].
- A public leaderboard with cross-model results (target: at least five frontier models) and the first quantitative measure of frontier-model spread on FHIR-grounded reasoning.
§ 2 Background and Related Work
2.1 The Medical LLM Benchmark Lineage
Medical LLM evaluation has passed through three generations. The first — MedQA[3], MedMCQA[4], PubMedQA[5] — consists of multiple-choice questions adapted from licensing exams or curated from biomedical literature. The second, exemplified by Med-PaLM's MultiMedQA[2], combined multiple-choice with open-ended human-evaluation panels along axes such as factuality, possible harm, and bias. The third — HealthBench[1] — formalised rubric-graded scoring at scale.
All three generations share an evaluation surface that is essentially a text vignette. None of them grade a model's ability to navigate a structured patient record. EHRSHOT[10] partially addresses this gap by evaluating foundation models on few-shot prediction tasks against longitudinal EHRs, but its task formulation reduces to fixed prediction heads rather than open-ended reasoning. Caliper occupies the previously empty intersection: open-ended, rubric-graded, FHIR-grounded.
2.2 LLM-as-Judge and Its Pitfalls
Zheng et al.[8] demonstrate that strong LLMs reach high agreement with human evaluators on open-ended responses but exhibit characteristic biases — position bias, verbosity bias, and self-enhancement bias when scoring outputs from a sibling model. These biases compromise single-judge evaluations of the kind HealthBench's default protocol uses. Verga et al.[9] show that replacing a single GPT-4 judge with a Panel of LLM Evaluators (PoLL) — three diverse smaller models — yields higher correlation with human ratings at lower cost. Caliper adopts the PoLL protocol verbatim.
2.3 Patient-Record Sources
Caliper's bundles come from two sources. The de-identified MIMIC-IV[6] ICU dataset, mapped to FHIR R4[11], supplies real clinical complexity for tasks where it matters — longitudinal trends, adverse-event identification. Synthea[7] supplies a clean synthetic backbone for tasks where PHI exposure would otherwise be problematic, and supports scaling Caliper to roughly five hundred tasks without DUA bottlenecks.
§ 3 Proposed Approach
3.1 Task Schema
Each Caliper task is a tuple (bundle, prompt, rubric, gold):
- bundle — a FHIR R4 Bundle resource serialised as JSON, sized between 5 KB and 80 KB. Bundles contain the minimum resource set required to answer the prompt plus distractor resources.
- prompt — a clinically plausible question phrased as a clinician might phrase it. ("Has this patient ever had an HbA1c above 9.0%, and if so, what was the most recent value?")
- rubric — a physician-style scoring rubric in the HealthBench format: a list of criteria each with a point weight, capturing partial credit and disqualifying errors.
- gold — a deterministic gold answer used for cross-validation against the rubric.
3.2 Task Categories and Distribution
| Category | Bundles | Source | Tier |
|---|---|---|---|
| Medication reconciliation | 110 | MIMIC-IV + Synthea | II |
| Abnormal-lab triage | 100 | MIMIC-IV | I |
| Problem-list reasoning | 90 | Synthea | II |
| Longitudinal trend detection | 110 | MIMIC-IV + Synthea | III |
| Adverse-event identification | 90 | MIMIC-IV | III |
Complexity tiers indicate the resource-span of the gold answer: Tier I tasks require evidence from a single resource type; Tier II from two; Tier III from three or more with temporal reasoning. The tier mix is deliberate — over-representation of single-resource tasks is what makes existing benchmarks easy.
How Caliper inherits HealthBench's rubric design at scale
HealthBench[1] ships with 5,000 conversations and 48,562 unique rubric criteria scored across 262 physician annotators in 26 specialties and 49 languages; GPT-4.1 is its default grader. Caliper preserves the rubric-scoring shape but reduces the conversation surface and adds the FHIR bundle as the new variable.
3.3 Panel-of-Judges Scoring
Each model response is scored independently by three judges drawn from disjoint model families (e.g., Claude Opus 4.7, GPT-5, Gemini 2.0 Pro). Each judge applies the task's rubric, returning per-criterion ratings and a total score. The final score is the median of the three judge totals; disagreement above a fixed threshold flags the task for human review. The panel-of-judges design follows Verga et al.[9] directly.
3.4 Calibration Set
A 50-task calibration subset is human-graded by two physician reviewers. The calibration set is used to (i) validate the rubric quality, (ii) measure judge-human agreement (target: Cohen's κ ≥ 0.70), and (iii) audit judge bias periodically as new models join the leaderboard.
§ 4 Evaluation Protocol
For each scored model, we report:
- Overall score — mean per-task rubric score across all 500 tasks, on a 0–1 scale.
- Per-category breakdowns for the five task categories listed in Table 1.
- Per-tier breakdowns distinguishing single-resource, two-resource, and three-plus-resource synthesis.
- Fact-invention rate — proportion of responses containing a clinical specific not retrievable from the FHIR bundle (analogous to Atrium's grounding-fidelity metric).
4.1 Comparison to Existing Benchmarks
Table 2 positions Caliper in the evaluation landscape:
| Benchmark | Format | Grounded in record? | Scoring |
|---|---|---|---|
| MedQA[3] | MCQ | No | Exact match |
| MedMCQA[4] | MCQ | No | Exact match |
| PubMedQA[5] | Yes/No/Maybe | Abstract | Exact match |
| MultiMedQA[2] | MCQ + open | No | Human panel |
| HealthBench[1] | Open dialogue | Free text | Single judge |
| EHRSHOT[10] | Few-shot prediction | EHR | Fixed metrics |
| Caliper | Open response | FHIR R4 | Panel of judges |
§ 5 Expected Contributions
- Dataset. Approximately five hundred publicly-released FHIR-grounded clinical reasoning tasks with physician-style rubrics — the first benchmark of its kind.
- Methodology. A reference panel-of-judges scoring harness with measured judge-human agreement.
- Empirical findings. The first quantitative measure of frontier-model spread on FHIR-grounded reasoning across categories that matter clinically.
§ 6 Limitations and Risks
Caliper inherits the limitations of its data sources. MIMIC-IV[6] represents ICU cohorts and may underrepresent ambulatory complexity; Synthea[7] is realistic but lacks real EHR noise. The panel-of-judges design assumes the panel's biases do not align — a failure mode noted by Zheng et al.[8] — which is precisely why we calibrate against human review on a 50-task subset rather than trusting the panel blindly.
A separate risk is benchmark gaming. If Caliper becomes influential, models will be tuned against it, and the benchmark loses signal. The mitigation is a held-out v0.2 expansion: a 100-task private set, periodically rotated, against which leaderboard entries are spot-checked.
§ 7 Conclusion
Caliper takes HealthBench's[1] rubric-graded methodology and plugs it into the data substrate clinical deployments actually use. By scoring frontier models against FHIR R4[11] resource graphs with a bias-corrected panel-of-judges[9], Caliper produces the first benchmark whose results have clear deployment-readiness implications. The expected outcome is a public leaderboard that frontier-model teams can publish against and clinical AI startups can use to ground their procurement decisions.
References
- Arora R, et al. (OpenAI). HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv preprint, 2025. arxiv.org/abs/2505.08775
- Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature, 620:172–180, 2023. nature.com/articles/s41586-023-06291-2
- Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams (MedQA). arXiv preprint, 2020. arxiv.org/abs/2009.13081
- Pal A, Umapathi LK, Sankarasubbu M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering. Proceedings of CHIL, 2022. proceedings.mlr.press/v174/pal22a.html
- Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: A Dataset for Biomedical Research Question Answering. EMNLP-IJCNLP, 2019. aclanthology.org/D19-1259
- Johnson AEW, Bulgarelli L, Shen L, et al. MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data, 10:1, 2023. nature.com/articles/s41597-022-01899-x
- Walonoski J, Kramer M, Nichols J, et al. Synthea: An approach, method, and software mechanism for generating synthetic patients. Journal of the American Medical Informatics Association, 25(3):230–238, 2018. academic.oup.com/jamia/article/25/3/230/4098271
- Zheng L, Chiang WL, Sheng Y, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS Datasets and Benchmarks, 2023. arxiv.org/abs/2306.05685
- Verga P, et al. (Cohere). Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv preprint, 2024. arxiv.org/abs/2404.18796
- Wornow M, Thapa R, Steinberg E, Fries J, Shah N. EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models. NeurIPS Datasets and Benchmarks, 2023. arxiv.org/abs/2307.02028
- HL7 International. HL7 FHIR Release 4 (R4) Specification, v4.0.1. Official HL7 standard, 2019. hl7.org/fhir/R4
C. Takeoff AI · Set in EB Garamond