Back to Dossier
Paper 04 / 10 Preliminary Manuscript · v0.1 May 2026
Dossier №01 · Project 04 · Longitude

Longitude: A Multi-Hop Diagnostic Reasoning Benchmark over Decade-Long Synthetic Patient Records

A clinical needle-in-a-haystack for the million-token era — 150 cases, each requiring synthesis across three or more temporally distant points in a 150–500k-token record.

Abstract Existing medical benchmarks evaluate reasoning over short vignettes. Real clinical reasoning, by contrast, requires connecting signals across years of records: a labs trend from 2019, a medication started in 2022, a family-history note buried in an intake form from 2014. We propose Longitude, a benchmark of approximately 150 cases, each consisting of a synthetic-but-clinically-realistic ten-year longitudinal record (150,000 to 500,000 tokens) and a diagnostic or treatment question whose correct answer requires synthesizing evidence from at least three temporally distant points. Distractor needles are scattered throughout. The benchmark draws methodologically from the long-context evaluation literature[1][2][3] and from clinician-generated EHR datasets[10], but is the first to combine multi-hop synthesis with realistic decade-spanning patient records. Initial scoring targets Claude (1M context), Gemini 2.0[7] (2M), GPT-5 with retrieval, and a Self-Route hybrid[8].

§ 1 Introduction

Frontier language models have crossed thresholds of context length unimaginable two years ago: Claude offers a 1M-token window in production, Gemini 1.5/2.0[7] offers 2M. The marketing case is straightforward — load the whole document, ask the question. The empirical case is more complicated. Liu et al.'s "Lost in the Middle"[1] showed that LLMs underweight information placed in the middle of long contexts. RULER[2] demonstrated that claimed context length diverges sharply from effective context length on multi-hop tasks. BABILong[3] extended this finding to haystacks up to 10 million tokens. The conclusion across all three: long context is real but qualitatively different from short context, and a benchmark that tests only retrieval will overestimate it.

Medicine is the natural application domain for long-context reasoning. A clinician reviewing a complex case opens chart after chart, often pulling up records from years prior to make sense of the present. Yet no public benchmark exists for clinical long-context reasoning. MedAlign[10] includes longitudinal EHR data, but its tasks top out at 32K tokens. EHRSHOT[11] is longitudinal but reduces to few-shot prediction. Longitude is the missing benchmark.

1.1 Contributions

  1. A public dataset of ~150 synthetic decade-long FHIR-formatted patient records, each 150,000 to 500,000 tokens, paired with multi-hop diagnostic/treatment questions and deterministic gold answers.
  2. A scoring harness extending the needle-in-a-haystack protocol[9] from single-needle retrieval to multi-needle clinical synthesis, with explicit distractor placement.
  3. A cross-model leaderboard comparing long-context-only architectures, retrieval-augmented baselines, and Self-Route hybrids[8].

§ 2 Background and Related Work

2.1 The Long-Context Evaluation Lineage

The literature on long-context evaluation has converged on a shared methodology: synthetic haystacks with controllable needles. Kamradt's NIAH harness[9] introduced the canonical implementation. LongBench[4] generalised the protocol across multiple tasks and languages. ∞Bench[5] pushed the average input length past 100K tokens. NoLiMa[6] demonstrated that removing lexical overlap between query and needle collapses retrieval performance, revealing that much "long-context" performance is keyword-driven rather than reasoning-driven. Longitude inherits this design principle directly: questions and gold evidence in our records share minimal lexical overlap, forcing latent clinical inference rather than keyword matching.

2.2 Long Context vs. Retrieval

Li et al.[8] conduct the most systematic head-to-head comparison of long-context LLMs against retrieval-augmented architectures, finding that neither dominates: long context wins on synthesis-heavy tasks, retrieval wins on focused-fact tasks, and a hybrid "Self-Route" architecture — where the model decides which mode to invoke — outperforms both. Longitude is explicitly designed to surface this tradeoff in the clinical domain, where both architectures are actively deployed and the empirical case for one or the other is not yet settled.

2.3 Clinical Longitudinal Data

MedAlign[10] is the closest existing clinical long-context dataset: 276 longitudinal EHRs with clinician-written instructions. Their quantitative finding — performance climbs from 51.8% to 60.1% as context expands from 2K to 32K tokens — establishes that more context helps for clinical reasoning, but it stops at 32K. Wornow et al.'s EHRSHOT[11] covers 41.6 million events across 6,739 patients but formulates evaluation as few-shot prediction with fixed task heads. Longitude bridges the two: clinician-style multi-hop questions over substantially larger contexts than MedAlign, with open-ended evaluation rather than prediction heads.

§ 3 Proposed Approach

3.1 Record Generation

Each Longitude record is produced by Synthea[12] with a custom event-aging engine. The base Synthea cohort provides a clinically valid backbone of conditions, encounters, observations, and prescriptions over the patient's lifetime. The aging engine then layers in:

Token counts are tuned so that the median record is approximately 300K tokens — comfortably above current frontier-model context windows for retrieval-only baselines.

Figure 1 · A Longitude case
2014 2016 2018 2020 2022 2024 family hx CRC, 2015 FOBT+ 2019 microcytic anemia 2023 URI 2017 sprain 2020 sinusitis 2021 Diagnostic question most likely cause of this patient's anemia? 150k – 500k tokens 3+ evidence points 3 distractors
Figure 1. A Longitude case sketches a synthetic 10-year FHIR record. Three temporally distant evidence points (top) must be synthesised; distractor needles (below) test the model's ability to ignore non-supporting findings. MedAlign[10] showed accuracy climbs from 51.8% at 2K to 60.1% at 32K tokens; Longitude extends the dynamic range an order of magnitude further. NoLiMa[6] demonstrated that GPT-4o falls from 99.3% to 69.7% when lexical-overlap cues are removed; Longitude inherits that no-lexical-overlap design. RULER[2] reports GPT-4-1106's effective context length is 64K despite a 128K claim — the gap Longitude is built to measure.

3.2 Question Categories

Table 1. Longitude v0.1 task distribution.
CategoryCasesRequired evidence span
Diagnostic synthesis403–5 resources, 2–7 years
Medication-history reconciliation303+ resources, 1–5 years
Latent risk identification303+ resources, 3–9 years
Adverse-event causal attribution303–4 resources, 1–4 years
Treatment-response trajectory20≥4 resources, 2–6 years

3.3 Scoring

Each model's response is scored on three axes:

  1. Final answer accuracy against a deterministic gold (binary).
  2. Evidence-trace fidelity: when the model is asked to cite specific record locations supporting its answer, what fraction of cited locations actually contain supporting evidence? This metric is adapted from the attribution framework of Rashkin et al. (the AIS framework, see Project 05).
  3. Distractor robustness: did the model incorporate any distractor needle into its answer? Binary.

§ 4 Evaluation Protocol

Initial baselines include:

Pass criterion Longitude v0.1 succeeds if it (a) cleanly demonstrates a measurable advantage for 1M context over retrieval-only on at least one task category, (b) reproduces the NoLiMa finding[6] that low-lexical-overlap questions collapse retrieval performance, and (c) provides the first clinical-domain quantification of the long-context-vs-retrieval frontier.

4.1 Expected Findings

Drawing on the general-domain literature[1][2][8], we predict three findings: (i) long-context models will outperform retrieval-only on diagnostic synthesis and latent risk identification, where the answer requires synthesis across resources without query-time keyword anchoring; (ii) retrieval will be competitive on medication reconciliation, which is more focused-fact; (iii) Self-Route hybrids will lead on aggregate score but will not dominate on every category. If any of these predictions fails, the failure is itself informative — and is the kind of result frontier-lab evaluation teams cite.

§ 5 Expected Contributions

  1. Dataset. ~150 publicly-released decade-long synthetic patient records with multi-hop questions and distractor needles — the first clinical long-context benchmark of its scale.
  2. Methodology. A scoring harness that combines final-answer accuracy with evidence-trace fidelity and distractor-robustness measurement.
  3. Empirical findings. The first quantitative head-to-head of long-context vs retrieval vs hybrid architectures in clinical multi-hop reasoning.

§ 6 Limitations and Risks

Synthea[12] is clinically validated but lacks the messiness of real EHR data — inconsistent coding, free-text in structured fields, conflicting historical resources, deprecated codes. Real records also contain artifacts from EHR migration that synthetic records cannot reproduce. Future versions of Longitude should extend to MIMIC-IV-derived records under appropriate DUA, accepting the shorter time horizon MIMIC's ICU focus implies.

A second limitation: the 1M-context regime is itself young. Models tested in Longitude v0.1 will be obsoleted quickly, and the benchmark must be re-run on subsequent frontier releases for the leaderboard to retain signal. We commit to a quarterly re-run.

§ 7 Conclusion

Longitude brings the discipline of needle-in-a-haystack evaluation[9] into the clinical domain at the scale that matters: decade-long records, multi-hop synthesis, distractor-rich. It is the benchmark that makes the 1M-token regime quantitatively visible in healthcare, and the only benchmark that lets a clinical AI team rationally choose between long-context and retrieval architectures based on evidence rather than vendor claims.

References

  1. Liu NF, Lin K, Hewitt J, et al. Lost in the Middle: How Language Models Use Long Contexts. TACL, 2024. arxiv.org/abs/2307.03172
  2. Hsieh CP, Sun S, Kriman S, et al. RULER: What's the Real Context Size of Your Long-Context Language Models? NVIDIA / COLM, 2024. arxiv.org/abs/2404.06654
  3. Kuratov Y, Bulatov A, Anokhin P, et al. BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack. NeurIPS Datasets and Benchmarks, 2024. arxiv.org/abs/2406.10149
  4. Bai Y, Lv X, Zhang J, et al. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. ACL, 2024. arxiv.org/abs/2308.14508
  5. Zhang X, Chen Y, Hu S, et al. ∞Bench: Extending Long Context Evaluation Beyond 100K Tokens. ACL, 2024. arxiv.org/abs/2402.13718
  6. Modarressi A, Deilamsalehy H, Dernoncourt F, et al. NoLiMa: Long-Context Evaluation Beyond Literal Matching. Adobe / ICML, 2025. arxiv.org/abs/2502.05167
  7. Gemini Team, Google. Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context. 2024. arxiv.org/abs/2403.05530
  8. Li Z, Li C, Zhang M, et al. Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. Google DeepMind / EMNLP Industry, 2024. arxiv.org/abs/2407.16833
  9. Kamradt G. Needle In A Haystack — Pressure Testing LLMs. GitHub repository, 2023. github.com/gkamradt/LLMTest_NeedleInAHaystack
  10. Fleming SL, Lozano A, Haberkorn WJ, et al. MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records. Stanford / AAAI, 2024. arxiv.org/abs/2308.14089
  11. Wornow M, Thapa R, Steinberg E, Fries J, Shah N. EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models. NeurIPS Datasets and Benchmarks, 2023. arxiv.org/abs/2307.02028
  12. Walonoski J, Kramer M, Nichols J, et al. Synthea: An approach, method, and software mechanism for generating synthetic patients. JAMIA, 25(3):230–238, 2018. academic.oup.com/jamia/article/25/3/230/4098271
— · § · — Preliminary manuscript · Longitude v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond