New Delhi, Jan 3 (PTI) With chatbots increasingly being relied upon for making sense of one's symptoms or test results, a study has shown that AI tools may not fare so well in conversations closer to real-world interactions, even as they perform well on medical exam-like tests.
The study, published in the journal Nature Medicine, also proposes recommendations for evaluating large-language models (LLM) -- they power chatbots such as ChatGPT -- before using them in clinical settings. LLMs are trained on massive text datasets and thus, can respond to a user's requests in the natural language.
Researchers at Harvard Medical School and Stanford University, US, designed a framework 'CRAFT-MD' to evaluate four LLMs, including GPT-4 and Mistral, for how well they performed in settings closely mimicking actual interactions with patients.
The framework analysed how well an LLM can collect information about symptoms, medications, and family history and then make a diagnosis. The performance of the AI tools was tested in 2,000 clinical descriptions, featuring conditions common in primary care and across 12 medical specialties.
The LLM was made to pose as a patient, answering questions in a conversational style. Another AI agent graded the accuracy of the final diagnosis as rendered by the AI tool, the researchers described.
Human experts then evaluated the outcomes of each patient interaction for an LLM's ability to gather relevant patient information, diagnostic accuracy when presented with scattered information, and for adherence to prompts.
All LLMs were found to show limitations, especially in their ability to reason and conduct clinical conversations based on information given by patients. This, in turn, compromised the AI tool's ability to take medical histories and render appropriate diagnosis, the researchers said.
"Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history-taking and diagnostic accuracy," the authors wrote.
For example, the AI tools often struggled to ask the right questions to gather pertinent patient history, missed critical information during history taking, and had difficulty synthesising scattered information, the team said.
These AI tools also performed worse when engaged in back-and-forth exchanges -- as most real-world conversations are -- rather than when engaged in summarised conversations, the researchers said.
"Our work reveals a striking paradox - while these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor's visit," senior author Pranav Rajpurkar, an assistant professor of biomedical informatics at Harvard Medical School, said.
"The dynamic nature of medical conversations -- the need to ask the right questions at the right time, to piece together scattered information, and to reason through symptoms -- poses unique challenges that go far beyond answering multiple choice questions.
"When we switch from standardised tests to these natural conversations, even the most sophisticated AI models show significant drops in diagnostic accuracy," Rajpurkar said.
An LLM's performance in clinical settings should be evaluated for its ability to ask the right questions and to extract the most essential information, the authors recommended.
They also advised using conversational, open-ended questions that more accurately reflect real-world doctor-patient interactions while designing, training, and testing the AI tools.