Dossier №01 · Project 02 · Caliper

Caliper: A FHIR-Grounded Extension of HealthBench for Patient-Record Reasoning

Five hundred clinical tasks scored against an actual patient record, not a paragraph of prose.

Chandra Takeoff AI · Healthcare AI Engineering Compiled · May 2026

Abstract HealthBench[1] established a standard for evaluating medical chat through physician-rubric-graded conversations, but its tasks are not grounded in a patient's actual structured record. We propose Caliper, a public benchmark of approximately five hundred tasks, each of which pairs a FHIR R4 bundle with a clinical question and a physician-style rubric. Tasks span medication reconciliation, abnormal-lab triage, problem-list reasoning, longitudinal trend detection, and adverse-event identification. We adopt a panel-of-judges scoring methodology[9] grounded in the LLM-as-judge literature[8], sourcing bundles from de-identified MIMIC-IV[6] and from Synthea[7]. Caliper differs from closed-form medical question benchmarks[3][4][5] by requiring grounded reasoning over a real FHIR resource graph, and from EHR few-shot benchmarks[10] by using open-ended rubric-scored generation rather than fixed prediction heads. The release target is a public leaderboard, an open dataset, and a peer-reviewed manuscript.

§ 1 Introduction

In May 2025, OpenAI released HealthBench[1], a 5,000-conversation evaluation in which physician-authored rubrics grade open-ended model responses. HealthBench moved medical model evaluation past multiple-choice and into the rubric-graded regime — a meaningful advance over MedQA[3], MedMCQA[4], and PubMedQA[5]. What HealthBench does not evaluate, however, is reasoning grounded in a patient's actual structured record. Its tasks consist of free-text scenarios; they do not require the model to navigate a FHIR resource graph the way a downstream clinical deployment would.

This is the gap Caliper fills. Caliper preserves HealthBench's rubric-graded methodology and extends it with two structural changes: every task is anchored to a FHIR R4[11] bundle, and scoring is performed by a panel of three diverse judge models[9] rather than a single arbiter. The result is a benchmark that tests the form of clinical reasoning a deployed system actually performs.

1.1 Contributions

A public dataset of approximately five hundred FHIR-grounded clinical tasks, each with a deterministic gold answer and an open-ended physician-style rubric.
A reference scoring harness that implements panel-of-judges evaluation[9] with audit traces, mitigating known LLM-as-judge biases[8].
A public leaderboard with cross-model results (target: at least five frontier models) and the first quantitative measure of frontier-model spread on FHIR-grounded reasoning.

§ 2 Background and Related Work

2.1 The Medical LLM Benchmark Lineage

Medical LLM evaluation has passed through three generations. The first — MedQA[3], MedMCQA[4], PubMedQA[5] — consists of multiple-choice questions adapted from licensing exams or curated from biomedical literature. The second, exemplified by Med-PaLM's MultiMedQA[2], combined multiple-choice with open-ended human-evaluation panels along axes such as factuality, possible harm, and bias. The third — HealthBench[1] — formalised rubric-graded scoring at scale.

All three generations share an evaluation surface that is essentially a text vignette. None of them grade a model's ability to navigate a structured patient record. EHRSHOT[10] partially addresses this gap by evaluating foundation models on few-shot prediction tasks against longitudinal EHRs, but its task formulation reduces to fixed prediction heads rather than open-ended reasoning. Caliper occupies the previously empty intersection: open-ended, rubric-graded, FHIR-grounded.

2.2 LLM-as-Judge and Its Pitfalls

Zheng et al.[8] demonstrate that strong LLMs reach high agreement with human evaluators on open-ended responses but exhibit characteristic biases — position bias, verbosity bias, and self-enhancement bias when scoring outputs from a sibling model. These biases compromise single-judge evaluations of the kind HealthBench's default protocol uses. Verga et al.[9] show that replacing a single GPT-4 judge with a Panel of LLM Evaluators (PoLL) — three diverse smaller models — yields higher correlation with human ratings at lower cost. Caliper adopts the PoLL protocol verbatim.

2.3 Patient-Record Sources

Caliper's bundles come from two sources. The de-identified MIMIC-IV[6] ICU dataset, mapped to FHIR R4[11], supplies real clinical complexity for tasks where it matters — longitudinal trends, adverse-event identification. Synthea[7] supplies a clean synthetic backbone for tasks where PHI exposure would otherwise be problematic, and supports scaling Caliper to roughly five hundred tasks without DUA bottlenecks.

§ 3 Proposed Approach

3.1 Task Schema

Each Caliper task is a tuple (bundle, prompt, rubric, gold):

bundle — a FHIR R4 Bundle resource serialised as JSON, sized between 5 KB and 80 KB. Bundles contain the minimum resource set required to answer the prompt plus distractor resources.
prompt — a clinically plausible question phrased as a clinician might phrase it. ("Has this patient ever had an HbA1c above 9.0%, and if so, what was the most recent value?")
rubric — a physician-style scoring rubric in the HealthBench format: a list of criteria each with a point weight, capturing partial credit and disqualifying errors.
gold — a deterministic gold answer used for cross-validation against the rubric.

3.2 Task Categories and Distribution

**Table 1.** Caliper v0.1 task distribution.
Category	Bundles	Source	Tier
Medication reconciliation	110	MIMIC-IV + Synthea	II
Abnormal-lab triage	100	MIMIC-IV	I
Problem-list reasoning	90	Synthea	II
Longitudinal trend detection	110	MIMIC-IV + Synthea	III
Adverse-event identification	90	MIMIC-IV	III

Complexity tiers indicate the resource-span of the gold answer: Tier I tasks require evidence from a single resource type; Tier II from two; Tier III from three or more with temporal reasoning. The tier mix is deliberate — over-representation of single-resource tasks is what makes existing benchmarks easy.

Figure 1 · Caliper scoring flow

Figure 1. (1) Each task is a four-tuple; (2) a model under evaluation produces an open-ended response over the supplied FHIR bundle; (3) three judges from disjoint model families score the response independently per the rubric, following the Panel-of-LLM-Evaluators design of Verga et al.[9] who report κ = 0.763 on KILT NQ (vs 0.627 for a single GPT-4 judge) at roughly one-seventh the cost; (4) the final score is the median of the three judge totals. Disagreement above a threshold flags the task for human review on the calibration set.

How Caliper inherits HealthBench's rubric design at scale

HealthBench[1] ships with 5,000 conversations and 48,562 unique rubric criteria scored across 262 physician annotators in 26 specialties and 49 languages; GPT-4.1 is its default grader. Caliper preserves the rubric-scoring shape but reduces the conversation surface and adds the FHIR bundle as the new variable.

3.3 Panel-of-Judges Scoring

Each model response is scored independently by three judges drawn from disjoint model families (e.g., Claude Opus 4.7, GPT-5, Gemini 2.0 Pro). Each judge applies the task's rubric, returning per-criterion ratings and a total score. The final score is the median of the three judge totals; disagreement above a fixed threshold flags the task for human review. The panel-of-judges design follows Verga et al.[9] directly.

3.4 Calibration Set

A 50-task calibration subset is human-graded by two physician reviewers. The calibration set is used to (i) validate the rubric quality, (ii) measure judge-human agreement (target: Cohen's κ ≥ 0.70), and (iii) audit judge bias periodically as new models join the leaderboard.

§ 4 Evaluation Protocol

For each scored model, we report:

Overall score — mean per-task rubric score across all 500 tasks, on a 0–1 scale.
Per-category breakdowns for the five task categories listed in Table 1.
Per-tier breakdowns distinguishing single-resource, two-resource, and three-plus-resource synthesis.
Fact-invention rate — proportion of responses containing a clinical specific not retrievable from the FHIR bundle (analogous to Atrium's grounding-fidelity metric).

Success criterion Caliper passes its v0.1 release bar when (a) at least five frontier models are scored, (b) the inter-model spread on overall score exceeds 15 points, (c) judge-human agreement on the calibration set reaches κ ≥ 0.70, and (d) at least one external paper cites Caliper within six months of release.

4.1 Comparison to Existing Benchmarks

Table 2 positions Caliper in the evaluation landscape:

**Table 2.** Caliper vs. existing medical LLM benchmarks.
Benchmark	Format	Grounded in record?	Scoring
MedQA[3]	MCQ	No	Exact match
MedMCQA[4]	MCQ	No	Exact match
PubMedQA[5]	Yes/No/Maybe	Abstract	Exact match
MultiMedQA[2]	MCQ + open	No	Human panel
HealthBench[1]	Open dialogue	Free text	Single judge
EHRSHOT[10]	Few-shot prediction	EHR	Fixed metrics
Caliper	Open response	FHIR R4	Panel of judges

§ 5 Expected Contributions

Dataset. Approximately five hundred publicly-released FHIR-grounded clinical reasoning tasks with physician-style rubrics — the first benchmark of its kind.
Methodology. A reference panel-of-judges scoring harness with measured judge-human agreement.
Empirical findings. The first quantitative measure of frontier-model spread on FHIR-grounded reasoning across categories that matter clinically.

§ 6 Limitations and Risks

Caliper inherits the limitations of its data sources. MIMIC-IV[6] represents ICU cohorts and may underrepresent ambulatory complexity; Synthea[7] is realistic but lacks real EHR noise. The panel-of-judges design assumes the panel's biases do not align — a failure mode noted by Zheng et al.[8] — which is precisely why we calibrate against human review on a 50-task subset rather than trusting the panel blindly.

A separate risk is benchmark gaming. If Caliper becomes influential, models will be tuned against it, and the benchmark loses signal. The mitigation is a held-out v0.2 expansion: a 100-task private set, periodically rotated, against which leaderboard entries are spot-checked.

§ 7 Conclusion

Caliper takes HealthBench's[1] rubric-graded methodology and plugs it into the data substrate clinical deployments actually use. By scoring frontier models against FHIR R4[11] resource graphs with a bias-corrected panel-of-judges[9], Caliper produces the first benchmark whose results have clear deployment-readiness implications. The expected outcome is a public leaderboard that frontier-model teams can publish against and clinical AI startups can use to ground their procurement decisions.

References

Arora R, et al. (OpenAI). HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv preprint, 2025. arxiv.org/abs/2505.08775
Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature, 620:172–180, 2023. nature.com/articles/s41586-023-06291-2
Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams (MedQA). arXiv preprint, 2020. arxiv.org/abs/2009.13081
Pal A, Umapathi LK, Sankarasubbu M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering. Proceedings of CHIL, 2022. proceedings.mlr.press/v174/pal22a.html
Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: A Dataset for Biomedical Research Question Answering. EMNLP-IJCNLP, 2019. aclanthology.org/D19-1259
Johnson AEW, Bulgarelli L, Shen L, et al. MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data, 10:1, 2023. nature.com/articles/s41597-022-01899-x
Walonoski J, Kramer M, Nichols J, et al. Synthea: An approach, method, and software mechanism for generating synthetic patients. Journal of the American Medical Informatics Association, 25(3):230–238, 2018. academic.oup.com/jamia/article/25/3/230/4098271
Zheng L, Chiang WL, Sheng Y, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS Datasets and Benchmarks, 2023. arxiv.org/abs/2306.05685
Verga P, et al. (Cohere). Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv preprint, 2024. arxiv.org/abs/2404.18796
Wornow M, Thapa R, Steinberg E, Fries J, Shah N. EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models. NeurIPS Datasets and Benchmarks, 2023. arxiv.org/abs/2307.02028
HL7 International. HL7 FHIR Release 4 (R4) Specification, v4.0.1. Official HL7 standard, 2019. hl7.org/fhir/R4

— · § · — Preliminary manuscript · Caliper v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond