Chartwalker: A Computer-Use Agent and Execution-Graded Benchmark for Electronic Health Record Operation
Twenty clinical workflows. A sandboxed OpenEMR. A Playwright grader that reads the database to score success. The first public computer-use benchmark for an actual EHR.
Abstract Arndt et al.[11] reported in Annals of Family Medicine (2017) that primary-care physicians spend approximately 5.9 hours per clinic day inside the EHR — including approximately 1.4 hours of after-hours "pajama time" (86 minutes per the original paper). Sinsky et al.[12] earlier quantified the broader pattern: roughly two hours of EHR work for every one hour of direct patient care. Computer-use agents are now technically capable of operating real desktop and web applications, as demonstrated by Anthropic's Computer Use release[1] and a sequence of GUI-agent benchmarks[2][3][4][5]. No public computer-use agent or benchmark operates an actual EHR. Chartwalker fills this gap: a sandboxed OpenEMR instance, a 20-task clinical workflow suite, and a Playwright-based grading harness that reads the database state to assert success — following the execution-graded pattern of WebArena[2] and OSWorld[4]. Pass criterion: ≥ 70% task completion driven by Claude Computer Use.
§ 1 Introduction
EHR navigation is a clinically meaningful task class. The Arndt and Sinsky findings[11][12] establish the size of the prize — every clinician hour saved on EHR mechanics is an hour returned to patient care or to clinician well-being. Yet despite the obvious importance, public computer-use agent work has focused on consumer web tasks (Mind2Web[7], WebVoyager[5], VisualWebArena[3]) and desktop apps (OSWorld[4]). No benchmark covers EHR operation, and no agent has been shown to complete clinical workflows end-to-end against a real EHR interface.
Chartwalker is the simplest version of that missing artifact. It is a benchmark, not a model. The architecture decisions are deliberate: a sandboxed OpenEMR runs in Docker, twenty workflows are defined with deterministic post-conditions in the EHR's PostgreSQL backend, and the Playwright-based grader asserts those post-conditions independently of the agent's claimed trajectory.
1.1 Contributions
- A sandboxed OpenEMR evaluation harness with reproducible setup and seeded patient data.
- A 20-task benchmark suite covering order entry, chart review, refill workflow, problem-list reconciliation, and result acknowledgment — each task with a deterministic post-condition gold.
- A public leaderboard with baseline results from Claude Computer Use[1], evaluated with per-task completion, per-step accuracy, and grounded-failure-mode classification.
§ 2 Background and Related Work
2.1 Computer-Use Agents
Anthropic's Computer Use release[1] (October 2024) introduced a screenshot-plus-cursor action loop in which the model receives the screen as an image and emits mouse and keyboard actions. Chartwalker is built on this loop. The agent is granted only screen access — it does not query the OpenEMR API directly — because the explicit point of the evaluation is to measure UI-mediated agency, not API-mediated agency.
2.2 Execution-Graded Web Benchmarks
WebArena[2] (ICLR 2024) established the methodology Chartwalker inherits: sandboxed full-stack web applications with execution-based graders that read backend state after the agent finishes. VisualWebArena[3] extended the framework to multimodal grounding and popularised Set-of-Mark prompting[9]. OSWorld[4] generalised execution grading to full desktop environments. WebVoyager[5] demonstrated end-to-end multimodal agents on live websites; SeeAct[6] decomposed web agency into action generation plus action grounding, a separation that informs our Chartwalker prompt structure.
2.3 Visual Grounding
Set-of-Mark[9] overlays numbered marks on UI widgets so the model can reference elements by ID rather than coordinates. Ferret-UI[10] demonstrates that UI-specific grounding on high-aspect-ratio screens outperforms generic vision-language models. Chartwalker overlays SoM marks on every screenshot before sending to the agent, an inexpensive intervention with well-documented benefits.
2.4 Generalist Web Agency
Mind2Web[7] evaluates agents across 137 websites with cross-domain generalisation as the central concern. Chartwalker inherits the cross-domain framing: while the v0.1 benchmark targets OpenEMR specifically, the task taxonomy is portable to other EHRs (OpenMRS, GNU Health), and a v0.2 effort should add at least one additional EHR for true generalisation testing.
2.5 Why an EHR-Specific Benchmark
Existing GUI-agent benchmarks systematically underrepresent the patterns that dominate EHR work: time-bounded ordering, structured-form entry with terminology binding, problem-list reconciliation, refill workflows with administrative steps. AgentBench[8] demonstrated that LLM agent performance varies dramatically by task category — aggregate scores conceal large per-category spreads. Chartwalker's per-category reporting is designed to surface the categories where current agents fail, which is the actionable result for downstream agent designers.
§ 3 Proposed Approach
3.1 Sandbox
OpenEMR (latest stable release) runs in Docker with PostgreSQL as the backend. A seed script populates the sandbox with 50 synthetic patients drawn from Synthea (see Atrium, Project 01), each with a realistic baseline of conditions, medications, allergies, and recent encounters. The sandbox is reset between every task evaluation so that no task can be confounded by state from a previous task.
3.2 The 20-Task Suite
| Category | Tasks | Example |
|---|---|---|
| Order entry | 5 | Order metformin 500 mg BID for patient P0001. |
| Chart review | 4 | Identify the patient's most recent HbA1c value and add a note acknowledging it. |
| Refill workflow | 3 | Refill atorvastatin for 90 days with two refills; cancel an expired prescription. |
| Problem-list reconciliation | 4 | Mark "Type 2 diabetes — uncontrolled" as resolved; add "Type 2 diabetes — controlled". |
| Result acknowledgment | 4 | Acknowledge a returned BMP, flag potassium as abnormal, message the patient. |
3.3 Execution-Based Grading
Each task has a pre-defined post-condition: a set of assertions over the OpenEMR database that must hold after the agent finishes. For example, "Order metformin 500 mg BID for P0001" has the post-condition:
The Playwright-based grader runs the agent's session, then connects to the PostgreSQL backend and asserts each post-condition. The task verdict is binary; partial credit is reported separately as the count of post-conditions met out of total.
3.4 Agent Prompting
The agent receives: (i) the task description in natural language, (ii) the initial OpenEMR screenshot with Set-of-Mark[9] overlay, (iii) a short, fixed system prompt describing OpenEMR's navigation idioms. No additional task-specific scaffolding is provided. The architecture is intentionally bare so that the benchmark measures the agent's intrinsic capability rather than the operator's ability to engineer a workflow-specific prompt.
§ 4 Evaluation Protocol
4.1 Metrics
| Metric | Definition | Reporting |
|---|---|---|
| Task completion rate | Binary: all post-conditions met. | Overall + per category. |
| Partial credit | Mean fraction of post-conditions met. | Per category. |
| Step efficiency | Mean actions per completed task. | Per category. |
| Failure-mode classification | One of: misclick, terminology-binding error, navigation loss, premature termination. | Distribution across failed tasks. |
4.2 Failure-Mode Audit
Every failed task is classified into one of four primary failure modes, drawn from the GUI-agent literature. Misclick: action grounded to the wrong widget, often a near-neighbour in the SoM overlay. Terminology-binding error: the model selected a clinically incorrect code (e.g., metformin extended-release versus immediate-release). Navigation loss: the model entered a screen from which it could not return. Premature termination: the model declared the task complete before all post-conditions held. Reporting this distribution is the practically useful output for downstream agent designers.
§ 5 Expected Contributions
- Benchmark. The first public, reproducible computer-use benchmark for an actual EHR.
- Failure-mode taxonomy. A classified failure-mode distribution from a frontier computer-use agent on clinical workflows.
- Leaderboard. Initial cross-agent results (Claude Computer Use, and any subsequent computer-use systems) with quarterly re-evaluation.
§ 6 Limitations and Risks
OpenEMR is open-source and stylistically distinct from production EHRs like Epic and Cerner; results on Chartwalker do not directly transfer to those systems. The architectural pattern — sandbox plus post-condition grader — does, but a v0.2 effort should add at least one additional EHR target. The Set-of-Mark[9] overlay is also an artifact of evaluation that production systems would not have; agents that depend heavily on SoM marks may underperform on raw EHR screens.
A separate risk: writing real prescriptions to a sandboxed EHR is safe; but an agent capable of writing prescriptions reliably is a precursor to systems that could write them to production EHRs. Chartwalker's release explicitly limits to the sandbox configuration and includes guidance against repointing the agent at production systems without institutional QA, in keeping with the broader literature on clinical LLM evaluation gaps.
§ 7 Conclusion
Chartwalker takes the well-established sandbox-plus-execution-grader pattern from WebArena[2] and OSWorld[4], applies it to the place clinicians actually spend their working day, and exposes a leaderboard that surfaces the per-category failure modes that aggregate scores conceal. It is the simplest version of a benchmark that should exist and currently does not.
References
- Anthropic. Introducing computer use, a new Claude 3.5 Sonnet, and a new Claude 3.5 Haiku. 2024. anthropic.com/news/3-5-models-and-computer-use
- Zhou S, Xu FF, Zhu H, et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR, 2024. arxiv.org/abs/2307.13854
- Koh JY, Lo R, Jang L, et al. VisualWebArena: Evaluating Multimodal Agents on Realistic Visually Grounded Web Tasks. ACL, 2024. arxiv.org/abs/2401.13649
- Xie T, Zhang D, Chen J, et al. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. NeurIPS, 2024. arxiv.org/abs/2404.07972
- He H, Yao W, Ma K, et al. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. ACL, 2024. arxiv.org/abs/2401.13919
- Zheng B, Gou B, Kil J, Sun H, Su Y. GPT-4V(ision) is a Generalist Web Agent, if Grounded (SeeAct). ICML, 2024. arxiv.org/abs/2401.01614
- Deng X, Gu Y, Zheng B, et al. Mind2Web: Towards a Generalist Agent for the Web. NeurIPS Spotlight, 2023. arxiv.org/abs/2306.06070
- Liu X, Yu H, Zhang H, et al. AgentBench: Evaluating LLMs as Agents. ICLR, 2024. arxiv.org/abs/2308.03688
- Yang J, Zhang H, Li F, et al. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. 2023. arxiv.org/abs/2310.11441
- You H, Zhang H, Gan Z, et al. Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs. ECCV, 2024. arxiv.org/abs/2404.05719
- Arndt BG, Beasley JW, Watkinson MD, et al. Tethered to the EHR: Primary Care Physician Workload Assessment Using EHR Event Log Data and Time-Motion Observations. Annals of Family Medicine, 15(5):419–426, 2017. annfammed.org/content/15/5/419
- Sinsky C, Colligan L, Li L, et al. Allocation of Physician Time in Ambulatory Practice: A Time and Motion Study in 4 Specialties. Annals of Internal Medicine, 165(11):753–760, 2016. acpjournals.org/doi/10.7326/M16-0961
C. Takeoff AI · Set in EB Garamond