Longitude: A Multi-Hop Diagnostic Reasoning Benchmark over Decade-Long Synthetic Patient Records
A clinical needle-in-a-haystack for the million-token era — 150 cases, each requiring synthesis across three or more temporally distant points in a 150–500k-token record.
Abstract Existing medical benchmarks evaluate reasoning over short vignettes. Real clinical reasoning, by contrast, requires connecting signals across years of records: a labs trend from 2019, a medication started in 2022, a family-history note buried in an intake form from 2014. We propose Longitude, a benchmark of approximately 150 cases, each consisting of a synthetic-but-clinically-realistic ten-year longitudinal record (150,000 to 500,000 tokens) and a diagnostic or treatment question whose correct answer requires synthesizing evidence from at least three temporally distant points. Distractor needles are scattered throughout. The benchmark draws methodologically from the long-context evaluation literature[1][2][3] and from clinician-generated EHR datasets[10], but is the first to combine multi-hop synthesis with realistic decade-spanning patient records. Initial scoring targets Claude (1M context), Gemini 2.0[7] (2M), GPT-5 with retrieval, and a Self-Route hybrid[8].
§ 1 Introduction
Frontier language models have crossed thresholds of context length unimaginable two years ago: Claude offers a 1M-token window in production, Gemini 1.5/2.0[7] offers 2M. The marketing case is straightforward — load the whole document, ask the question. The empirical case is more complicated. Liu et al.'s "Lost in the Middle"[1] showed that LLMs underweight information placed in the middle of long contexts. RULER[2] demonstrated that claimed context length diverges sharply from effective context length on multi-hop tasks. BABILong[3] extended this finding to haystacks up to 10 million tokens. The conclusion across all three: long context is real but qualitatively different from short context, and a benchmark that tests only retrieval will overestimate it.
Medicine is the natural application domain for long-context reasoning. A clinician reviewing a complex case opens chart after chart, often pulling up records from years prior to make sense of the present. Yet no public benchmark exists for clinical long-context reasoning. MedAlign[10] includes longitudinal EHR data, but its tasks top out at 32K tokens. EHRSHOT[11] is longitudinal but reduces to few-shot prediction. Longitude is the missing benchmark.
1.1 Contributions
- A public dataset of ~150 synthetic decade-long FHIR-formatted patient records, each 150,000 to 500,000 tokens, paired with multi-hop diagnostic/treatment questions and deterministic gold answers.
- A scoring harness extending the needle-in-a-haystack protocol[9] from single-needle retrieval to multi-needle clinical synthesis, with explicit distractor placement.
- A cross-model leaderboard comparing long-context-only architectures, retrieval-augmented baselines, and Self-Route hybrids[8].
§ 2 Background and Related Work
2.1 The Long-Context Evaluation Lineage
The literature on long-context evaluation has converged on a shared methodology: synthetic haystacks with controllable needles. Kamradt's NIAH harness[9] introduced the canonical implementation. LongBench[4] generalised the protocol across multiple tasks and languages. ∞Bench[5] pushed the average input length past 100K tokens. NoLiMa[6] demonstrated that removing lexical overlap between query and needle collapses retrieval performance, revealing that much "long-context" performance is keyword-driven rather than reasoning-driven. Longitude inherits this design principle directly: questions and gold evidence in our records share minimal lexical overlap, forcing latent clinical inference rather than keyword matching.
2.2 Long Context vs. Retrieval
Li et al.[8] conduct the most systematic head-to-head comparison of long-context LLMs against retrieval-augmented architectures, finding that neither dominates: long context wins on synthesis-heavy tasks, retrieval wins on focused-fact tasks, and a hybrid "Self-Route" architecture — where the model decides which mode to invoke — outperforms both. Longitude is explicitly designed to surface this tradeoff in the clinical domain, where both architectures are actively deployed and the empirical case for one or the other is not yet settled.
2.3 Clinical Longitudinal Data
MedAlign[10] is the closest existing clinical long-context dataset: 276 longitudinal EHRs with clinician-written instructions. Their quantitative finding — performance climbs from 51.8% to 60.1% as context expands from 2K to 32K tokens — establishes that more context helps for clinical reasoning, but it stops at 32K. Wornow et al.'s EHRSHOT[11] covers 41.6 million events across 6,739 patients but formulates evaluation as few-shot prediction with fixed task heads. Longitude bridges the two: clinician-style multi-hop questions over substantially larger contexts than MedAlign, with open-ended evaluation rather than prediction heads.
§ 3 Proposed Approach
3.1 Record Generation
Each Longitude record is produced by Synthea[12] with a custom event-aging engine. The base Synthea cohort provides a clinically valid backbone of conditions, encounters, observations, and prescriptions over the patient's lifetime. The aging engine then layers in:
- Realistic note volume. Each encounter generates a free-text note (history of present illness, assessment, plan) consistent with the structured event.
- Distractor needles. Each record contains 3–5 "false positive" findings or red-herring data points designed to mislead a keyword-search-only model.
- Gold-evidence anchoring. Each question's gold answer requires evidence from ≥3 distinct, temporally distant resources, with the temporal span ranging from 2 to 9 years.
Token counts are tuned so that the median record is approximately 300K tokens — comfortably above current frontier-model context windows for retrieval-only baselines.
3.2 Question Categories
| Category | Cases | Required evidence span |
|---|---|---|
| Diagnostic synthesis | 40 | 3–5 resources, 2–7 years |
| Medication-history reconciliation | 30 | 3+ resources, 1–5 years |
| Latent risk identification | 30 | 3+ resources, 3–9 years |
| Adverse-event causal attribution | 30 | 3–4 resources, 1–4 years |
| Treatment-response trajectory | 20 | ≥4 resources, 2–6 years |
3.3 Scoring
Each model's response is scored on three axes:
- Final answer accuracy against a deterministic gold (binary).
- Evidence-trace fidelity: when the model is asked to cite specific record locations supporting its answer, what fraction of cited locations actually contain supporting evidence? This metric is adapted from the attribution framework of Rashkin et al. (the AIS framework, see Project 05).
- Distractor robustness: did the model incorporate any distractor needle into its answer? Binary.
§ 4 Evaluation Protocol
Initial baselines include:
- Long-context-only. Claude Opus 4.7 (1M), Gemini 2.0 Pro (2M)[7], GPT-5 long-context.
- Retrieval-only. GPT-5 with embeddings-based retrieval; Claude with retrieval-augmented prompting.
- Hybrid. Self-Route[8] implementations layered over each frontier model.
4.1 Expected Findings
Drawing on the general-domain literature[1][2][8], we predict three findings: (i) long-context models will outperform retrieval-only on diagnostic synthesis and latent risk identification, where the answer requires synthesis across resources without query-time keyword anchoring; (ii) retrieval will be competitive on medication reconciliation, which is more focused-fact; (iii) Self-Route hybrids will lead on aggregate score but will not dominate on every category. If any of these predictions fails, the failure is itself informative — and is the kind of result frontier-lab evaluation teams cite.
§ 5 Expected Contributions
- Dataset. ~150 publicly-released decade-long synthetic patient records with multi-hop questions and distractor needles — the first clinical long-context benchmark of its scale.
- Methodology. A scoring harness that combines final-answer accuracy with evidence-trace fidelity and distractor-robustness measurement.
- Empirical findings. The first quantitative head-to-head of long-context vs retrieval vs hybrid architectures in clinical multi-hop reasoning.
§ 6 Limitations and Risks
Synthea[12] is clinically validated but lacks the messiness of real EHR data — inconsistent coding, free-text in structured fields, conflicting historical resources, deprecated codes. Real records also contain artifacts from EHR migration that synthetic records cannot reproduce. Future versions of Longitude should extend to MIMIC-IV-derived records under appropriate DUA, accepting the shorter time horizon MIMIC's ICU focus implies.
A second limitation: the 1M-context regime is itself young. Models tested in Longitude v0.1 will be obsoleted quickly, and the benchmark must be re-run on subsequent frontier releases for the leaderboard to retain signal. We commit to a quarterly re-run.
§ 7 Conclusion
Longitude brings the discipline of needle-in-a-haystack evaluation[9] into the clinical domain at the scale that matters: decade-long records, multi-hop synthesis, distractor-rich. It is the benchmark that makes the 1M-token regime quantitatively visible in healthcare, and the only benchmark that lets a clinical AI team rationally choose between long-context and retrieval architectures based on evidence rather than vendor claims.
References
- Liu NF, Lin K, Hewitt J, et al. Lost in the Middle: How Language Models Use Long Contexts. TACL, 2024. arxiv.org/abs/2307.03172
- Hsieh CP, Sun S, Kriman S, et al. RULER: What's the Real Context Size of Your Long-Context Language Models? NVIDIA / COLM, 2024. arxiv.org/abs/2404.06654
- Kuratov Y, Bulatov A, Anokhin P, et al. BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack. NeurIPS Datasets and Benchmarks, 2024. arxiv.org/abs/2406.10149
- Bai Y, Lv X, Zhang J, et al. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. ACL, 2024. arxiv.org/abs/2308.14508
- Zhang X, Chen Y, Hu S, et al. ∞Bench: Extending Long Context Evaluation Beyond 100K Tokens. ACL, 2024. arxiv.org/abs/2402.13718
- Modarressi A, Deilamsalehy H, Dernoncourt F, et al. NoLiMa: Long-Context Evaluation Beyond Literal Matching. Adobe / ICML, 2025. arxiv.org/abs/2502.05167
- Gemini Team, Google. Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context. 2024. arxiv.org/abs/2403.05530
- Li Z, Li C, Zhang M, et al. Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. Google DeepMind / EMNLP Industry, 2024. arxiv.org/abs/2407.16833
- Kamradt G. Needle In A Haystack — Pressure Testing LLMs. GitHub repository, 2023. github.com/gkamradt/LLMTest_NeedleInAHaystack
- Fleming SL, Lozano A, Haberkorn WJ, et al. MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records. Stanford / AAAI, 2024. arxiv.org/abs/2308.14089
- Wornow M, Thapa R, Steinberg E, Fries J, Shah N. EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models. NeurIPS Datasets and Benchmarks, 2023. arxiv.org/abs/2307.02028
- Walonoski J, Kramer M, Nichols J, et al. Synthea: An approach, method, and software mechanism for generating synthetic patients. JAMIA, 25(3):230–238, 2018. academic.oup.com/jamia/article/25/3/230/4098271
C. Takeoff AI · Set in EB Garamond