Back to Dossier
Paper 07 / 10 Preliminary Manuscript · v0.1 May 2026
Dossier №01 · Project 07 · Auris

Auris: End-to-End Ambient Voice-to-FHIR Pipeline with Structured Resource Validation

From clinician-patient audio to discrete, write-ready Observation, Condition, and MedicationRequest resources — validated against FHIR R4 profiles before they ever surface.

Abstract Ambient AI scribing has become the headline product category for clinical AI in 2026, validated in recent NEJM AI work — Tierney et al.'s 2025 pragmatic RCT on ambient AI scribe well-being[13] and Lukac et al.'s 2025 head-to-head trial of DAX Copilot vs Nabla (N=238, 14 specialties) showing Nabla reduced time-in-note by 9.5% (95% CI −17.2 to −1.8, p=0.02)[15]. Production scribes, however, predominantly emit unstructured narrative, leaving the discrete FHIR resources the EHR actually consumes to a downstream step. Auris is an end-to-end pipeline from clinician-patient audio to structured, validated FHIR R4 resources. The pipeline composes Whisper-large-v3[1] for transcription, pyannote 3.x[2][3] for diarisation, Claude with grammar-constrained JSON-Schema decoding[5][6][7] for resource extraction, and the official HL7 FHIR Validator for profile conformance. The methodology builds on the SOAP-note generation literature[8][9], the medication-event extraction work from n2c2[11][12], and the recent Infherno[10] agent-based FHIR synthesis prior art. The release target is a 100-dialogue synthetic dataset with gold FHIR; pass criterion is ≥ 80% F1 on FHIR resource extraction.

§ 1 Introduction

Two recent NEJM AI publications provide the most rigorous evidence to date that ambient scribes affect clinician workload. Tierney et al.'s pragmatic RCT[13] measured practitioner well-being outcomes from an ambient AI scribe in production. Lukac et al.'s 2025 head-to-head RCT[15] compared DAX Copilot and Nabla across 238 physicians in 14 specialties (Nov 2024 – Jan 2025): Nabla cut time-in-note by 9.5% (95% CI −17.2 to −1.8, p=0.02) while DAX showed a non-significant 1.7% reduction; both arms showed Mini-Z burnout improvements but neither reduced pajama-time / after-hours EHR use. The trial's output, however, is a narrative SOAP note — the same artifact clinical scribes have produced for decades. The EHR consumes that narrative downstream, parsing it into discrete fields with variable fidelity.

The valuable artifact — for billing, for population health, for downstream AI use — is structured FHIR. Auris targets that artifact directly, producing FHIR resources as the primary output and a derived narrative note as the secondary. The Infherno preprint[10] (2025) is the closest prior art and confirms the architecture is feasible; we extend by adding diarisation, explicit profile validation, and a released evaluation dataset.

1.1 Contributions

  1. An open-source, reproducible audio-to-FHIR pipeline composed of named, citation-anchored components.
  2. A released dataset of 100 synthetic clinician-patient dialogues with gold FHIR annotations covering Observation, Condition, MedicationRequest, and AllergyIntolerance.
  3. An evaluation harness measuring F1 on resource extraction with per-resource breakdowns and a documented characterisation of Whisper's failure modes on clinical speech, drawing on Adedeji et al.[4]

§ 2 Background and Related Work

2.1 ASR for Clinical Speech

Whisper[1] trained on 680,000 hours of weakly-supervised multilingual audio sets the open-source baseline for general ASR. Adedeji et al.[4] evaluate Whisper specifically in clinical context and document its failure modes: drug-name mistranscription (e.g., "Toprol" → "to prove all"), dosage misalignment, hallucinated phrases during silence. Auris does not propose ASR improvements; it documents the Adedeji failure pattern and confines downstream stages so that ambiguous transcriptions are not silently lifted into FHIR resources.

2.2 Speaker Diarisation

Bredin's pyannote.audio 2.1[2] established the open diarisation baseline; the 3.x line introduced the powerset multi-class cross-entropy loss of Plaquet & Bredin[3], which substantially improves performance on overlapping speech — a frequent pattern in clinician-patient dialogue. Auris uses pyannote 3.x with the powerset segmenter.

2.3 Structured Output via Constrained Decoding

FHIR validity is a hard requirement: an extracted Observation that fails its profile is unusable. Free-form LLM generation is insufficient even for well-prompted models. Willard & Louf's Outlines[5] introduced the finite-state-machine approach to constrained generation; Geng et al.[6] generalised to grammar-constrained decoding without fine-tuning. Their 2025 JSONSchemaBench[7] benchmarks the major implementations and quantifies the tradeoff between structural validity and answer quality. Auris uses JSON-Schema-constrained decoding against the FHIR R4 resource schemas, with the schemas being the source of truth.

2.4 SOAP and Note Generation

Krishna et al.[8] (ACL 2021) introduced the canonical formulation of conversation-to-SOAP-note as a modular summarisation task. MTS-Dialog[9] (EACL 2023) released 1,700 doctor-patient conversation-note pairs with back-translation augmentation; Auris's synthetic dialogue release follows their format closely so that the data can be combined.

2.5 Clinical Information Extraction

The n2c2 shared tasks define the field. The 2018 adverse-drug-event and medication-extraction track (Henry et al.[11]) established the benchmark format. The 2022 contextualised medication-event track (Mahajan et al.[12]) added the context fields — dose change, frequency change, route change — that map directly onto FHIR MedicationRequest's elements. Auris evaluates against the n2c2 medication metric on its extracted MedicationRequest resources for direct comparability.

2.6 Closest Prior Art

Infherno (Frei et al., 2025)[10] describes an end-to-end LLM-agent pipeline from free-text clinical notes to FHIR resources with terminology grounding. Auris extends Infherno along three axes: (i) audio as the input modality, with diarisation; (ii) constrained-decoding-based structural guarantee rather than post-hoc validation; (iii) a publicly released evaluation dataset.

§ 3 Proposed Approach

3.1 Pipeline

Table 1. Auris pipeline stages and component versions.
StageComponentOutput
1. TranscriptionWhisper-large-v3[1]Raw text with word-level timestamps.
2. Diarisationpyannote 3.x w/ powerset loss[3]Speaker-labelled segments (clinician / patient / other).
3. ReconciliationToken-aligned mergeDiarised transcript with role tags.
4. Resource extractionClaude Opus 4.7 + JSON-Schema constrained decoding[7]FHIR resources (Observation, Condition, MedicationRequest, AllergyIntolerance).
5. ValidationOfficial HL7 FHIR Validator (US-Core profiles)Profile-conformant resources or structured errors.
6. Note renderingSOAP-format template from resourcesNarrative note for clinician review.

Note that the SOAP narrative is derived from the FHIR resources, not generated separately. This inverts the prevailing scribe architecture and guarantees that the note and the structured resources stay consistent.

Figure 1 · Audio-to-FHIR pipeline
1 Whisper large-v3 ASR 2 pyannote 3.x · powerset diarise 3 Align role tag merge 4 Claude JSON-Schema extract 5 FHIR Validator US-Core 6 SOAP render from FHIR Audio in clinician-patient encounter FHIR resources out Observation · Condition · MedReq text + spkr + roles JSON validated Koenecke et al.: ~1% of Whisper transcripts contain hallucinations; 38% of those carry explicit harm
Figure 1. The six-stage audio-to-FHIR pipeline. Note the inversion of the prevailing scribe architecture: the SOAP narrative is rendered from validated FHIR resources rather than parsed back into structure from a generated note — guaranteeing the structured and narrative outputs cannot disagree. The Whisper hallucination annotation references Koenecke et al.'s FAccT 2024 study which audited transcripts and found ~1% contain entirely hallucinated phrases, 38% of those including explicit harms; Auris compensates by flagging low-confidence segments for clinician review rather than auto-promoting them through Stage 4.

3.2 Released Dataset

A 100-dialogue synthetic corpus is generated by simulated clinician-patient pairs using Claude as the simulator, conditioned on Synthea-generated patient backgrounds. Each dialogue is paired with gold FHIR resources hand-curated by the author. Dialogues span: outpatient new-patient visit (25), outpatient follow-up (25), ED triage (20), telehealth (15), home-health (15). Audio is generated by a high-quality TTS (e.g., ElevenLabs) with two distinct voices and naturalistic backchannels. The dialogue size and split structure mirrors ACI-Bench[16] (Yim et al., Nature Scientific Data 2023) — the canonical 207-conversation + clinical-note pair ambient-clinical-intelligence benchmark — so a v0.2 Auris release can directly benchmark against ACI-Bench's published numbers. PriMock57's[17] 57 mock primary-care consultations (~9 hours of audio across seven clinicians) provides additional cross-dataset validation for ASR-on-medical-dialogue performance.

§ 4 Evaluation Protocol

4.1 Metrics

Table 2. Auris evaluation metrics.
MetricDefinitionTarget
Word Error Rate (WER)Whisper-only metric, characterising the failure surface per Adedeji[4].Report only
Diarisation Error Rate (DER)Standard pyannote metric on the 100-dialogue corpus.< 10%
Resource F1Per-resource type F1: a predicted resource matches gold if its code + value + subject all match.≥ 0.80
Validation pass rateFraction of predicted resources that pass HL7 FHIR Validator against US-Core profiles.≥ 0.98
Medication-context F1n2c2 2022 contextualised metric on MedicationRequest predictions[12].≥ 0.75
Pass criterion Auris v0.1 succeeds when overall Resource F1 ≥ 0.80 on the held-out evaluation split, with no resource type below F1 0.65 (an Observation F1 of 0.85 paired with a Condition F1 of 0.40 would fail the criterion despite a high average).

4.2 Baselines

Compared against (i) Whisper → unstructured note → post-hoc FHIR extraction (the production-scribe pattern), (ii) Whisper → Claude direct generation without constrained decoding (no validation gate), and (iii) Infherno[10] reproduced on the same corpus. Baseline (i) is expected to be competitive on free-text fields and poor on structural fidelity; (ii) is expected to be high on F1 but low on validation pass rate; (iii) is the cleanest head-to-head.

§ 5 Expected Contributions

  1. System. An open-source end-to-end audio-to-FHIR pipeline with constrained-decoding-based structural guarantee.
  2. Dataset. 100 synthetic clinician-patient dialogues with audio and gold FHIR — released under Creative Commons.
  3. Empirical findings. Quantitative characterisation of Whisper failure modes on clinical speech and the F1 cost of imposing FHIR profile validation as a hard gate.

§ 6 Limitations and Risks

Synthetic dialogues capture the structural skeleton of clinician-patient conversation but understate three real-world phenomena: code-switching (clinical jargon mid-conversation), patient-side disfluency, and environmental noise. Real-deployment evaluation on a partnering institution's ambient corpus under IRB is the necessary follow-on. Adedeji et al.[4] also document Whisper's tendency to hallucinate during silence — a failure mode that constrained decoding does not address because it operates at a later stage; Auris compensates by flagging low-confidence transcripts for clinician review rather than auto-promoting them.

A separate concern is voice biometric leakage: even synthetic audio could in principle be misused. The released dataset uses TTS voices with no biometric correspondence to real individuals; this is documented in the data card.

§ 7 Conclusion

Auris reframes the ambient scribe as a FHIR-first system rather than a note-first one. Every component is off-the-shelf, every choice is anchored in the literature, and every output is structurally validated before surfacing. The combination is the simplest expression of an architectural claim worth testing: that the right output of an ambient AI is not a paragraph but a set of write-ready clinical resources, with the note as a derived view.

References

  1. Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I. Robust Speech Recognition via Large-Scale Weak Supervision (Whisper). ICML, 2023. arxiv.org/abs/2212.04356
  2. Bredin H. pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. Interspeech, 2023. isca-archive.org/interspeech_2023/bredin23_interspeech.html
  3. Plaquet A, Bredin H. Powerset multi-class cross entropy loss for neural speaker diarization. Interspeech, 2023. arxiv.org/abs/2310.13025
  4. Adedeji A, et al. Evaluating ASR in a Clinical Context: What Whisper Misses. ICNLSP, 2025. aclanthology.org/2025.icnlsp-1.36
  5. Willard BT, Louf R. Efficient Guided Generation for Large Language Models. 2023. arxiv.org/abs/2307.09702
  6. Geng S, Josifoski M, Peyrard M, West R. Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning. EMNLP, 2023. arxiv.org/abs/2305.13971
  7. Geng S, et al. JSONSchemaBench: A Rigorous Benchmark of Structured Outputs for Language Models. 2025. arxiv.org/abs/2501.10868
  8. Krishna K, Khosla S, Bigham JP, Lipton ZC. Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques. ACL, 2021. arxiv.org/abs/2005.01795
  9. Ben Abacha A, Yim W, Fan Y, Lin T. An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters (MTS-Dialog). EACL, 2023. aclanthology.org/2023.eacl-main.168
  10. Frei J, Feldhus N, Raithel L, Roller R, Meyer S, Kramer F. Infherno: End-to-end Agent-based FHIR Resource Synthesis from Free-form Clinical Notes. arXiv preprint, 2025. arxiv.org/abs/2507.12261
  11. Henry S, Buchan K, Filannino M, Stubbs A, Uzuner Ö. 2018 n2c2 Shared Task on Adverse Drug Events and Medication Extraction in Electronic Health Records. JAMIA, 2020. pmc.ncbi.nlm.nih.gov/articles/PMC7489085
  12. Mahajan D, et al. Overview of the 2022 n2c2 Shared Task on Contextualized Medication Event Extraction in Clinical Notes. Journal of Biomedical Informatics, 2023. pmc.ncbi.nlm.nih.gov/articles/PMC10529825
  13. Tierney AA, et al. A Pragmatic Randomized Controlled Trial of Ambient Artificial Intelligence to Improve Health Practitioner Well-Being. NEJM AI, 2025. ai.nejm.org/doi/abs/10.1056/AIoa2500945
  14. Koenecke A, Choi AS, Mei KX, Schellmann H, Sloane M. Careless Whisper: Speech-to-Text Hallucination Harms. ACM FAccT, 2024. Audits Whisper transcripts and reports ~1% contain entirely hallucinated phrases, with 38% of those including explicit harms (violence, false authority, made-up associations). arxiv.org/abs/2402.08021
  15. Lukac M, Turner A, Vangala S, et al. Ambient AI Scribes in Clinical Practice: A Randomized Trial. NEJM AI, 2025. N=238 physicians, 14 specialties; Nabla cut time-in-note by 9.5% (95% CI −17.2 to −1.8, p=0.02); DAX -1.7% (n.s.); Mini-Z burnout +2.83 (DAX) / +2.69 (Nabla); no pajama-time reduction. ai.nejm.org/doi/abs/10.1056/AIoa2501000
  16. Yim WW, Fu Y, Ben Abacha A, Snider N, Lin T, Yetisgen-Yildiz M. ACI-Bench: A Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Scientific Data (Nature), 2023. 207 doctor-patient conversation + clinical note pairs; the canonical ambient-clinical-intelligence visit-note benchmark. nature.com/articles/s41597-023-02487-3
  17. Korfiatis A, Moramarco F, Sarac R, Savkov A. PriMock57: A Dataset of Primary Care Mock Consultations. ACL, 2022. 57 mock primary-care consultations, 7 clinicians, audio + utterance-level transcripts + clinician notes; ~9 hours of audio — the standard ASR-for-medical-dialogue cross-benchmark. aclanthology.org/2022.acl-short.65
  18. Gandhi S, von Platen P, Rush A. Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling. arXiv 2311.00430, 2023. 5.8× faster, 51% fewer parameters, within 1% WER on out-of-distribution test data; less prone to long-form hallucinations than parent Whisper — an inference-cost optimisation lever for Auris deployment. arxiv.org/abs/2311.00430
  19. Dong K, Ruan T, et al. XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models. arXiv 2411.15100, 2024. Up to 100× speedup over prior structured-decoding solutions, <40 µs/token overhead; default backend in vLLM, SGLang, TensorRT-LLM. arxiv.org/abs/2411.15100
— · § · — Preliminary manuscript · Auris v0.1 · Dossier №01
C. Takeoff AI · Set in EB Garamond