Auris: End-to-End Ambient Voice-to-FHIR Pipeline with Structured Resource Validation
From clinician-patient audio to discrete, write-ready Observation, Condition, and MedicationRequest resources — validated against FHIR R4 profiles before they ever surface.
Abstract Ambient AI scribing has become the headline product category for clinical AI in 2026, validated in recent NEJM AI work — Tierney et al.'s 2025 pragmatic RCT on ambient AI scribe well-being[13] and Lukac et al.'s 2025 head-to-head trial of DAX Copilot vs Nabla (N=238, 14 specialties) showing Nabla reduced time-in-note by 9.5% (95% CI −17.2 to −1.8, p=0.02)[15]. Production scribes, however, predominantly emit unstructured narrative, leaving the discrete FHIR resources the EHR actually consumes to a downstream step. Auris is an end-to-end pipeline from clinician-patient audio to structured, validated FHIR R4 resources. The pipeline composes Whisper-large-v3[1] for transcription, pyannote 3.x[2][3] for diarisation, Claude with grammar-constrained JSON-Schema decoding[5][6][7] for resource extraction, and the official HL7 FHIR Validator for profile conformance. The methodology builds on the SOAP-note generation literature[8][9], the medication-event extraction work from n2c2[11][12], and the recent Infherno[10] agent-based FHIR synthesis prior art. The release target is a 100-dialogue synthetic dataset with gold FHIR; pass criterion is ≥ 80% F1 on FHIR resource extraction.
§ 1 Introduction
Two recent NEJM AI publications provide the most rigorous evidence to date that ambient scribes affect clinician workload. Tierney et al.'s pragmatic RCT[13] measured practitioner well-being outcomes from an ambient AI scribe in production. Lukac et al.'s 2025 head-to-head RCT[15] compared DAX Copilot and Nabla across 238 physicians in 14 specialties (Nov 2024 – Jan 2025): Nabla cut time-in-note by 9.5% (95% CI −17.2 to −1.8, p=0.02) while DAX showed a non-significant 1.7% reduction; both arms showed Mini-Z burnout improvements but neither reduced pajama-time / after-hours EHR use. The trial's output, however, is a narrative SOAP note — the same artifact clinical scribes have produced for decades. The EHR consumes that narrative downstream, parsing it into discrete fields with variable fidelity.
The valuable artifact — for billing, for population health, for downstream AI use — is structured FHIR. Auris targets that artifact directly, producing FHIR resources as the primary output and a derived narrative note as the secondary. The Infherno preprint[10] (2025) is the closest prior art and confirms the architecture is feasible; we extend by adding diarisation, explicit profile validation, and a released evaluation dataset.
1.1 Contributions
- An open-source, reproducible audio-to-FHIR pipeline composed of named, citation-anchored components.
- A released dataset of 100 synthetic clinician-patient dialogues with gold FHIR annotations covering Observation, Condition, MedicationRequest, and AllergyIntolerance.
- An evaluation harness measuring F1 on resource extraction with per-resource breakdowns and a documented characterisation of Whisper's failure modes on clinical speech, drawing on Adedeji et al.[4]
§ 2 Background and Related Work
2.1 ASR for Clinical Speech
Whisper[1] trained on 680,000 hours of weakly-supervised multilingual audio sets the open-source baseline for general ASR. Adedeji et al.[4] evaluate Whisper specifically in clinical context and document its failure modes: drug-name mistranscription (e.g., "Toprol" → "to prove all"), dosage misalignment, hallucinated phrases during silence. Auris does not propose ASR improvements; it documents the Adedeji failure pattern and confines downstream stages so that ambiguous transcriptions are not silently lifted into FHIR resources.
2.2 Speaker Diarisation
Bredin's pyannote.audio 2.1[2] established the open diarisation baseline; the 3.x line introduced the powerset multi-class cross-entropy loss of Plaquet & Bredin[3], which substantially improves performance on overlapping speech — a frequent pattern in clinician-patient dialogue. Auris uses pyannote 3.x with the powerset segmenter.
2.3 Structured Output via Constrained Decoding
FHIR validity is a hard requirement: an extracted Observation that fails its profile is unusable. Free-form LLM generation is insufficient even for well-prompted models. Willard & Louf's Outlines[5] introduced the finite-state-machine approach to constrained generation; Geng et al.[6] generalised to grammar-constrained decoding without fine-tuning. Their 2025 JSONSchemaBench[7] benchmarks the major implementations and quantifies the tradeoff between structural validity and answer quality. Auris uses JSON-Schema-constrained decoding against the FHIR R4 resource schemas, with the schemas being the source of truth.
2.4 SOAP and Note Generation
Krishna et al.[8] (ACL 2021) introduced the canonical formulation of conversation-to-SOAP-note as a modular summarisation task. MTS-Dialog[9] (EACL 2023) released 1,700 doctor-patient conversation-note pairs with back-translation augmentation; Auris's synthetic dialogue release follows their format closely so that the data can be combined.
2.5 Clinical Information Extraction
The n2c2 shared tasks define the field. The 2018 adverse-drug-event and medication-extraction track (Henry et al.[11]) established the benchmark format. The 2022 contextualised medication-event track (Mahajan et al.[12]) added the context fields — dose change, frequency change, route change — that map directly onto FHIR MedicationRequest's elements. Auris evaluates against the n2c2 medication metric on its extracted MedicationRequest resources for direct comparability.
2.6 Closest Prior Art
Infherno (Frei et al., 2025)[10] describes an end-to-end LLM-agent pipeline from free-text clinical notes to FHIR resources with terminology grounding. Auris extends Infherno along three axes: (i) audio as the input modality, with diarisation; (ii) constrained-decoding-based structural guarantee rather than post-hoc validation; (iii) a publicly released evaluation dataset.
§ 3 Proposed Approach
3.1 Pipeline
| Stage | Component | Output |
|---|---|---|
| 1. Transcription | Whisper-large-v3[1] | Raw text with word-level timestamps. |
| 2. Diarisation | pyannote 3.x w/ powerset loss[3] | Speaker-labelled segments (clinician / patient / other). |
| 3. Reconciliation | Token-aligned merge | Diarised transcript with role tags. |
| 4. Resource extraction | Claude Opus 4.7 + JSON-Schema constrained decoding[7] | FHIR resources (Observation, Condition, MedicationRequest, AllergyIntolerance). |
| 5. Validation | Official HL7 FHIR Validator (US-Core profiles) | Profile-conformant resources or structured errors. |
| 6. Note rendering | SOAP-format template from resources | Narrative note for clinician review. |
Note that the SOAP narrative is derived from the FHIR resources, not generated separately. This inverts the prevailing scribe architecture and guarantees that the note and the structured resources stay consistent.
3.2 Released Dataset
A 100-dialogue synthetic corpus is generated by simulated clinician-patient pairs using Claude as the simulator, conditioned on Synthea-generated patient backgrounds. Each dialogue is paired with gold FHIR resources hand-curated by the author. Dialogues span: outpatient new-patient visit (25), outpatient follow-up (25), ED triage (20), telehealth (15), home-health (15). Audio is generated by a high-quality TTS (e.g., ElevenLabs) with two distinct voices and naturalistic backchannels. The dialogue size and split structure mirrors ACI-Bench[16] (Yim et al., Nature Scientific Data 2023) — the canonical 207-conversation + clinical-note pair ambient-clinical-intelligence benchmark — so a v0.2 Auris release can directly benchmark against ACI-Bench's published numbers. PriMock57's[17] 57 mock primary-care consultations (~9 hours of audio across seven clinicians) provides additional cross-dataset validation for ASR-on-medical-dialogue performance.
§ 4 Evaluation Protocol
4.1 Metrics
| Metric | Definition | Target |
|---|---|---|
| Word Error Rate (WER) | Whisper-only metric, characterising the failure surface per Adedeji[4]. | Report only |
| Diarisation Error Rate (DER) | Standard pyannote metric on the 100-dialogue corpus. | < 10% |
| Resource F1 | Per-resource type F1: a predicted resource matches gold if its code + value + subject all match. | ≥ 0.80 |
| Validation pass rate | Fraction of predicted resources that pass HL7 FHIR Validator against US-Core profiles. | ≥ 0.98 |
| Medication-context F1 | n2c2 2022 contextualised metric on MedicationRequest predictions[12]. | ≥ 0.75 |
4.2 Baselines
Compared against (i) Whisper → unstructured note → post-hoc FHIR extraction (the production-scribe pattern), (ii) Whisper → Claude direct generation without constrained decoding (no validation gate), and (iii) Infherno[10] reproduced on the same corpus. Baseline (i) is expected to be competitive on free-text fields and poor on structural fidelity; (ii) is expected to be high on F1 but low on validation pass rate; (iii) is the cleanest head-to-head.
§ 5 Expected Contributions
- System. An open-source end-to-end audio-to-FHIR pipeline with constrained-decoding-based structural guarantee.
- Dataset. 100 synthetic clinician-patient dialogues with audio and gold FHIR — released under Creative Commons.
- Empirical findings. Quantitative characterisation of Whisper failure modes on clinical speech and the F1 cost of imposing FHIR profile validation as a hard gate.
§ 6 Limitations and Risks
Synthetic dialogues capture the structural skeleton of clinician-patient conversation but understate three real-world phenomena: code-switching (clinical jargon mid-conversation), patient-side disfluency, and environmental noise. Real-deployment evaluation on a partnering institution's ambient corpus under IRB is the necessary follow-on. Adedeji et al.[4] also document Whisper's tendency to hallucinate during silence — a failure mode that constrained decoding does not address because it operates at a later stage; Auris compensates by flagging low-confidence transcripts for clinician review rather than auto-promoting them.
A separate concern is voice biometric leakage: even synthetic audio could in principle be misused. The released dataset uses TTS voices with no biometric correspondence to real individuals; this is documented in the data card.
§ 7 Conclusion
Auris reframes the ambient scribe as a FHIR-first system rather than a note-first one. Every component is off-the-shelf, every choice is anchored in the literature, and every output is structurally validated before surfacing. The combination is the simplest expression of an architectural claim worth testing: that the right output of an ambient AI is not a paragraph but a set of write-ready clinical resources, with the note as a derived view.
References
- Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I. Robust Speech Recognition via Large-Scale Weak Supervision (Whisper). ICML, 2023. arxiv.org/abs/2212.04356
- Bredin H. pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. Interspeech, 2023. isca-archive.org/interspeech_2023/bredin23_interspeech.html
- Plaquet A, Bredin H. Powerset multi-class cross entropy loss for neural speaker diarization. Interspeech, 2023. arxiv.org/abs/2310.13025
- Adedeji A, et al. Evaluating ASR in a Clinical Context: What Whisper Misses. ICNLSP, 2025. aclanthology.org/2025.icnlsp-1.36
- Willard BT, Louf R. Efficient Guided Generation for Large Language Models. 2023. arxiv.org/abs/2307.09702
- Geng S, Josifoski M, Peyrard M, West R. Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning. EMNLP, 2023. arxiv.org/abs/2305.13971
- Geng S, et al. JSONSchemaBench: A Rigorous Benchmark of Structured Outputs for Language Models. 2025. arxiv.org/abs/2501.10868
- Krishna K, Khosla S, Bigham JP, Lipton ZC. Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques. ACL, 2021. arxiv.org/abs/2005.01795
- Ben Abacha A, Yim W, Fan Y, Lin T. An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters (MTS-Dialog). EACL, 2023. aclanthology.org/2023.eacl-main.168
- Frei J, Feldhus N, Raithel L, Roller R, Meyer S, Kramer F. Infherno: End-to-end Agent-based FHIR Resource Synthesis from Free-form Clinical Notes. arXiv preprint, 2025. arxiv.org/abs/2507.12261
- Henry S, Buchan K, Filannino M, Stubbs A, Uzuner Ö. 2018 n2c2 Shared Task on Adverse Drug Events and Medication Extraction in Electronic Health Records. JAMIA, 2020. pmc.ncbi.nlm.nih.gov/articles/PMC7489085
- Mahajan D, et al. Overview of the 2022 n2c2 Shared Task on Contextualized Medication Event Extraction in Clinical Notes. Journal of Biomedical Informatics, 2023. pmc.ncbi.nlm.nih.gov/articles/PMC10529825
- Tierney AA, et al. A Pragmatic Randomized Controlled Trial of Ambient Artificial Intelligence to Improve Health Practitioner Well-Being. NEJM AI, 2025. ai.nejm.org/doi/abs/10.1056/AIoa2500945
- Koenecke A, Choi AS, Mei KX, Schellmann H, Sloane M. Careless Whisper: Speech-to-Text Hallucination Harms. ACM FAccT, 2024. Audits Whisper transcripts and reports ~1% contain entirely hallucinated phrases, with 38% of those including explicit harms (violence, false authority, made-up associations). arxiv.org/abs/2402.08021
- Lukac M, Turner A, Vangala S, et al. Ambient AI Scribes in Clinical Practice: A Randomized Trial. NEJM AI, 2025. N=238 physicians, 14 specialties; Nabla cut time-in-note by 9.5% (95% CI −17.2 to −1.8, p=0.02); DAX -1.7% (n.s.); Mini-Z burnout +2.83 (DAX) / +2.69 (Nabla); no pajama-time reduction. ai.nejm.org/doi/abs/10.1056/AIoa2501000
- Yim WW, Fu Y, Ben Abacha A, Snider N, Lin T, Yetisgen-Yildiz M. ACI-Bench: A Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Scientific Data (Nature), 2023. 207 doctor-patient conversation + clinical note pairs; the canonical ambient-clinical-intelligence visit-note benchmark. nature.com/articles/s41597-023-02487-3
- Korfiatis A, Moramarco F, Sarac R, Savkov A. PriMock57: A Dataset of Primary Care Mock Consultations. ACL, 2022. 57 mock primary-care consultations, 7 clinicians, audio + utterance-level transcripts + clinician notes; ~9 hours of audio — the standard ASR-for-medical-dialogue cross-benchmark. aclanthology.org/2022.acl-short.65
- Gandhi S, von Platen P, Rush A. Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling. arXiv 2311.00430, 2023. 5.8× faster, 51% fewer parameters, within 1% WER on out-of-distribution test data; less prone to long-form hallucinations than parent Whisper — an inference-cost optimisation lever for Auris deployment. arxiv.org/abs/2311.00430
- Dong K, Ruan T, et al. XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models. arXiv 2411.15100, 2024. Up to 100× speedup over prior structured-decoding solutions, <40 µs/token overhead; default backend in vLLM, SGLang, TensorRT-LLM. arxiv.org/abs/2411.15100
C. Takeoff AI · Set in EB Garamond