Researchers at the National Institutes of Health (NIH) have found that an AI model demonstrated high accuracy in solving medical quiz questions designed to test health professionals’ diagnostic skills using clinical images and brief text summaries. Despite this, physicians noted that the AI often erred in describing images and explaining its reasoning.
Published in *npj Digital Medicine*, the study led by NIH’s National Library of Medicine (NLM) and Weill Cornell Medicine, NYC, reveals that while AI shows promise in accelerating diagnoses, it still falls short of human expertise.
“AI integration in healthcare could enhance diagnostic speed and early treatment initiation,” said NLM Acting Director Stephen Sherry, Ph.D. “However, as this study indicates, AI cannot yet replace human experience, which remains essential for accurate diagnosis.”
The study involved an AI model answering 207 questions from the New England Journal of Medicine’s Image Challenge, requiring image descriptions, medical knowledge summaries, and step-by-step reasoning. Physicians from diverse specialties answered questions in both “closed-book” (no external resources) and “open-book” (with external resources) settings.
The AI model often chose the correct diagnosis more frequently than physicians in closed-book scenarios, but physicians with access to resources performed better, especially with more challenging questions. Despite accurate final diagnoses, the AI struggled with image descriptions and reasoning, sometimes misinterpreting the same condition presented differently.
This research highlights the need for further evaluation of multi-modal AI before its clinical adoption. NLM Senior Investigator Zhiyong Lu, Ph.D., emphasized the potential of AI to augment clinical decision-making but stressed the importance of understanding its limitations.
The study used the GPT-4V model, a multimodal AI capable of processing text and images. While the study is preliminary, it offers insights into the potential and limitations of such AI models in medical decision-making. The study involved collaborators from various prestigious institutions, including the University of Pittsburgh, UT Southwestern Medical Center, and Harvard Medical School.