Back to Dossier
Paper 02 / 10 Preliminary Manuscript · v0.1 May 2026
Dossier №01 · Project 02 · Caliper

Caliper: A FHIR-Grounded Extension of HealthBench for Patient-Record Reasoning

Five hundred clinical tasks scored against an actual patient record, not a paragraph of prose.

Abstract HealthBench[1] established a standard for evaluating medical chat through physician-rubric-graded conversations, but its tasks are not grounded in a patient's actual structured record. We propose Caliper, a public benchmark of approximately five hundred tasks, each of which pairs a FHIR R4 bundle with a clinical question and a physician-style rubric. Tasks span medication reconciliation, abnormal-lab triage, problem-list reasoning, longitudinal trend detection, and adverse-event identification. We adopt a panel-of-judges scoring methodology[9] grounded in the LLM-as-judge literature[8], sourcing bundles from de-identified MIMIC-IV[6] and from Synthea[7]. Caliper differs from closed-form medical question benchmarks[3][4][5] by requiring grounded reasoning over a real FHIR resource graph, and from EHR few-shot benchmarks[10] by using open-ended rubric-scored generation rather than fixed prediction heads. The release target is a public leaderboard, an open dataset, and a peer-reviewed manuscript.

§ 1 Introduction

In May 2025, OpenAI released HealthBench[1], a 5,000-conversation evaluation in which physician-authored rubrics grade open-ended model responses. HealthBench moved medical model evaluation past multiple-choice and into the rubric-graded regime — a meaningful advance over MedQA[3], MedMCQA[4], and PubMedQA[5]. What HealthBench does not evaluate, however, is reasoning grounded in a patient's actual structured record. Its tasks consist of free-text scenarios; they do not require the model to navigate a FHIR resource graph the way a downstream clinical deployment would.

This is the gap Caliper fills. Caliper preserves HealthBench's rubric-graded methodology and extends it with two structural changes: every task is anchored to a FHIR R4[11] bundle, and scoring is performed by a panel of three diverse judge models[9] rather than a single arbiter. The result is a benchmark that tests the form of clinical reasoning a deployed system actually performs.

1.1 Contributions

  1. A public dataset of approximately five hundred FHIR-grounded clinical tasks, each with a deterministic gold answer and an open-ended physician-style rubric.
  2. A reference scoring harness that implements panel-of-judges evaluation[9] with audit traces, mitigating known LLM-as-judge biases[8].
  3. A public leaderboard with cross-model results (target: at least five frontier models) and the first quantitative measure of frontier-model spread on FHIR-grounded reasoning.

§ 2 Background and Related Work

2.1 The Medical LLM Benchmark Lineage

Medical LLM evaluation has passed through three generations. The first — MedQA[3], MedMCQA[4], PubMedQA[5] — consists of multiple-choice questions adapted from licensing exams or curated from biomedical literature. The second, exemplified by Med-PaLM's MultiMedQA[2], combined multiple-choice with open-ended human-evaluation panels along axes such as factuality, possible harm, and bias. The third — HealthBench[1] — formalised rubric-graded scoring at scale.

All three generations share an evaluation surface that is essentially a text vignette. None of them grade a model's ability to navigate a structured patient record. EHRSHOT[10] partially addresses this gap by evaluating foundation models on few-shot prediction tasks against longitudinal EHRs, but its task formulation reduces to fixed prediction heads rather than open-ended reasoning. Caliper occupies the previously empty intersection: open-ended, rubric-graded, FHIR-grounded.

2.2 LLM-as-Judge and Its Pitfalls

Zheng et al.[8] demonstrate that strong LLMs reach high agreement with human evaluators on open-ended responses but exhibit characteristic biases — position bias, verbosity bias, and self-enhancement bias when scoring outputs from a sibling model. These biases compromise single-judge evaluations of the kind HealthBench's default protocol uses. Verga et al.[9] show that replacing a single GPT-4 judge with a Panel of LLM Evaluators (PoLL) — three diverse smaller models — yields higher correlation with human ratings at lower cost. Caliper adopts the PoLL protocol verbatim.

2.3 Patient-Record Sources

Caliper's bundles come from two sources. The de-identified MIMIC-IV[6] ICU dataset, mapped to FHIR R4[11], supplies real clinical complexity for tasks where it matters — longitudinal trends, adverse-event identification. Synthea[7] supplies a clean synthetic backbone for tasks where PHI exposure would otherwise be problematic, and supports scaling Caliper to roughly five hundred tasks without DUA bottlenecks.

§ 3 Proposed Approach

3.1 Task Schema

Each Caliper task is a tuple (bundle, prompt, rubric, gold):

3.2 Task Categories and Distribution

Table 1. Caliper v0.1 task distribution.
CategoryBundlesSourceTier
Medication reconciliation110MIMIC-IV + SyntheaII
Abnormal-lab triage100MIMIC-IVI
Problem-list reasoning90SyntheaII
Longitudinal trend detection110MIMIC-IV + SyntheaIII
Adverse-event identification90MIMIC-IVIII

Complexity tiers indicate the resource-span of the gold answer: Tier I tasks require evidence from a single resource type; Tier II from two; Tier III from three or more with temporal reasoning. The tier mix is deliberate — over-representation of single-resource tasks is what makes existing benchmarks easy.

Figure 1 · Caliper scoring flow
Task tuple (bundle, prompt, rubric, gold) Model under evaluation open response Judge A Claude Opus Judge B GPT-5 Judge C Gemini 2.0 Pro Median score + leaderboard parallel 1 2 3 4
Figure 1. (1) Each task is a four-tuple; (2) a model under evaluation produces an open-ended response over the supplied FHIR bundle; (3) three judges from disjoint model families score the response independently per the rubric, following the Panel-of-LLM-Evaluators design of Verga et al.[9] who report κ = 0.763 on KILT NQ (vs 0.627 for a single GPT-4 judge) at roughly one-seventh the cost; (4) the final score is the median of the three judge totals. Disagreement above a threshold flags the task for human review on the calibration set.
How Caliper inherits HealthBench's rubric design at scale

HealthBench[1] ships with 5,000 conversations and 48,562 unique rubric criteria scored across 262 physician annotators in 26 specialties and 49 languages; GPT-4.1 is its default grader. Caliper preserves the rubric-scoring shape but reduces the conversation surface and adds the FHIR bundle as the new variable.

3.3 Panel-of-Judges Scoring

Each model response is scored independently by three judges drawn from disjoint model families (e.g., Claude Opus 4.7, GPT-5, Gemini 2.0 Pro). Each judge applies the task's rubric, returning per-criterion ratings and a total score. The final score is the median of the three judge totals; disagreement above a fixed threshold flags the task for human review. The panel-of-judges design follows Verga et al.[9] directly.

3.4 Calibration Set

A 50-task calibration subset is human-graded by two physician reviewers. The calibration set is used to (i) validate the rubric quality, (ii) measure judge-human agreement (target: Cohen's κ ≥ 0.70), and (iii) audit judge bias periodically as new models join the leaderboard.

§ 4 Evaluation Protocol

For each scored model, we report:

  1. Overall score — mean per-task rubric score across all 500 tasks, on a 0–1 scale.
  2. Per-category breakdowns for the five task categories listed in Table 1.
  3. Per-tier breakdowns distinguishing single-resource, two-resource, and three-plus-resource synthesis.
  4. Fact-invention rate — proportion of responses containing a clinical specific not retrievable from the FHIR bundle (analogous to Atrium's grounding-fidelity metric).
Success criterion Caliper passes its v0.1 release bar when (a) at least five frontier models are scored, (b) the inter-model spread on overall score exceeds 15 points, (c) judge-human agreement on the calibration set reaches κ ≥ 0.70, and (d) at least one external paper cites Caliper within six months of release.

4.1 Comparison to Existing Benchmarks

Table 2 positions Caliper in the evaluation landscape:

Table 2. Caliper vs. existing medical LLM benchmarks.
BenchmarkFormatGrounded in record?Scoring
MedQA[3]MCQNoExact match
MedMCQA[4]MCQNoExact match
PubMedQA[5]Yes/No/MaybeAbstractExact match
MultiMedQA[2]MCQ + openNoHuman panel
HealthBench[1]Open dialogueFree textSingle judge
EHRSHOT[10]Few-shot predictionEHRFixed metrics
CaliperOpen responseFHIR R4Panel of judges

§ 5 Expected Contributions

  1. Dataset. Approximately five hundred publicly-released FHIR-grounded clinical reasoning tasks with physician-style rubrics — the first benchmark of its kind.
  2. Methodology. A reference panel-of-judges scoring harness with measured judge-human agreement.
  3. Empirical findings. The first quantitative measure of frontier-model spread on FHIR-grounded reasoning across categories that matter clinically.

§ 6 Limitations and Risks

Caliper inherits the limitations of its data sources. MIMIC-IV[6] represents ICU cohorts and may underrepresent ambulatory complexity; Synthea[7] is realistic but lacks real EHR noise. The panel-of-judges design assumes the panel's biases do not align — a failure mode noted by Zheng et al.[8] — which is precisely why we calibrate against human review on a 50-task subset rather than trusting the panel blindly.

A separate risk is benchmark gaming. If Caliper becomes influential, models will be tuned against it, and the benchmark loses signal. The mitigation is a held-out v0.2 expansion: a 100-task private set, periodically rotated, against which leaderboard entries are spot-checked.

§ 7 Conclusion

Caliper takes HealthBench's[1] rubric-graded methodology and plugs it into the data substrate clinical deployments actually use. By scoring frontier models against FHIR R4[11] resource graphs with a bias-corrected panel-of-judges[9], Caliper produces the first benchmark whose results have clear deployment-readiness implications. The expected outcome is a public leaderboard that frontier-model teams can publish against and clinical AI startups can use to ground their procurement decisions.

References

  1. Arora R, et al. (OpenAI). HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv preprint, 2025. arxiv.org/abs/2505.08775
  2. Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature, 620:172–180, 2023. nature.com/articles/s41586-023-06291-2
  3. Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams (MedQA). arXiv preprint, 2020. arxiv.org/abs/2009.13081
  4. Pal A, Umapathi LK, Sankarasubbu M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering. Proceedings of CHIL, 2022. proceedings.mlr.press/v174/pal22a.html
  5. Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: A Dataset for Biomedical Research Question Answering. EMNLP-IJCNLP, 2019. aclanthology.org/D19-1259
  6. Johnson AEW, Bulgarelli L, Shen L, et al. MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data, 10:1, 2023. nature.com/articles/s41597-022-01899-x
  7. Walonoski J, Kramer M, Nichols J, et al. Synthea: An approach, method, and software mechanism for generating synthetic patients. Journal of the American Medical Informatics Association, 25(3):230–238, 2018. academic.oup.com/jamia/article/25/3/230/4098271
  8. Zheng L, Chiang WL, Sheng Y, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS Datasets and Benchmarks, 2023. arxiv.org/abs/2306.05685
  9. Verga P, et al. (Cohere). Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv preprint, 2024. arxiv.org/abs/2404.18796
  10. Wornow M, Thapa R, Steinberg E, Fries J, Shah N. EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models. NeurIPS Datasets and Benchmarks, 2023. arxiv.org/abs/2307.02028
  11. HL7 International. HL7 FHIR Release 4 (R4) Specification, v4.0.1. Official HL7 standard, 2019. hl7.org/fhir/R4
— · § · — Preliminary manuscript · Caliper v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond