AI in the Exam Room? A Critical Look at DeepMind’s AMIE for Healthcare

2 min read

The paper titled "Towards Conversational AI for Disease Management" (full paper here: arXiv:2503.06074) introduces a novel large language model (LLM)-based system named AMIE, designed to assist in clinical disease management. AMIE extends the capabilities of the Articulate Medical Intelligence Explorer by integrating reasoning over disease progression, therapeutic responses, and medication prescriptions across multiple patient visits. It utilises Gemini's long-context capabilities, combining in-context retrieval with structured reasoning to align its outputs with authoritative clinical guidelines and drug formularies.

In a randomised, blinded virtual Objective Structured Clinical Examination (OSCE), AMIE was compared to 21 primary care physicians (PCPs) across 100 multi-visit case scenarios based on UK NICE and BMJ Best Practice guidelines. Specialist physicians assessed AMIE's management reasoning as non-inferior to that of PCPs. Additionally, AMIE demonstrated superior performance in the precision of treatments and investigations, as well as in aligning and grounding management plans with clinical guidelines.

To benchmark medication reasoning, the authors developed RxQA, a multiple-choice question benchmark derived from US and UK drug formularies and validated by board-certified pharmacists. AMIE outperformed PCPs on higher difficulty questions, particularly when both had access to external drug information.

The paper concludes by mentioning that, while further research is needed before real-world implementation, AMIE's strong performance across evaluations marks a significant step towards conversational AI as a tool in disease management. The findings, however, do not suggest that AMIE is ready for clinical care.

GenAI (LLM)-based systems are not suitable for handling healthcare processes mainly for the following reasons:

Lack of clinical accountability
GenAI models do not hold legal or ethical responsibility for clinical decisions, making it risky to rely on them without physician oversight in life-impacting situations.
Unpredictable errors and hallucinations
Even advanced GenAI systems can produce plausible-sounding but incorrect or fabricated medical information, which could mislead clinicians or harm patients.
Insufficient real-world validation
Most GenAI models are tested in simulated or controlled environments. They lack robust, peer-reviewed evidence from diverse, real-world clinical settings needed to ensure safety and generalizability.
Missing a strong auditing schema
Many GenAI models function as "black boxes," offering limited transparency into their decision-making processes, which challenges trust and hinders error tracing in critical care contexts.

In conclusion, GenAI tools can engage with patients for low-value requests. But critical high-value processes must be carried out by a conversational process automation, which is auditable, fully configurable by experts and designed to deliver patients' requests end-to-end.

The Expert Effect: What Doctors and Software Providers Have in Common