AI Takes a Leap in Healthcare: GPT-3.5 and GPT-4 Demonstrate Proficiency in Clinical Reasoning
January 29, 2024In a recent publication in npj Digital Medicine, researchers created diagnostic reasoning prompts to assess the ability of large language models (LLMs) to simulate diagnostic clinical reasoning.
Large Language Models (LLMs), artificial intelligence-based systems trained on extensive text data, have demonstrated human-like proficiency in tasks such as generating clinical notes and successfully completing medical exams. Despite these capabilities, comprehending their diagnostic reasoning skills is vital for their effective integration into clinical care.
Recent research has focused on assessing LLMs in handling open-ended clinical questions, highlighting the potential of advanced models like GPT-4 to discern complex patient cases. The ongoing development of prompt engineering is addressing the variability in LLM performance, considering the influence of different prompts and question types.
In this study, researchers conducted an assessment of diagnostic reasoning performed by GPT-3.5 and GPT-4 in response to open-ended clinical questions. The hypothesis was that GPT models could surpass conventional chain-of-thought (CoT) prompting when equipped with diagnostic reasoning prompts.
The study utilized the revised MedQA United States Medical Licensing Exam (USMLE) dataset and the New England Journal of Medicine (NEJM) case series. The comparison involved evaluating conventional CoT prompting against various diagnostic logic prompts inspired by cognitive processes such as forming a differential diagnosis, analytical reasoning, Bayesian inferences, and intuitive reasoning.
The primary objective was to investigate whether large-language models, guided by specialized prompts, could emulate clinical reasoning skills by combining clinical expertise with advanced prompting techniques.
Prompt engineering played a crucial role in generating prompts for diagnostic reasoning, transforming questions into free responses by eliminating multiple-choice options. Specifically, only step II and step III questions from the USMLE dataset, focusing on patient diagnosis, were included. Each round of prompt engineering involved evaluating GPT-3.5 accuracy using the MEDQA training set, with distinct sets for training (95 questions) and testing (518 questions).
GPT-4 performance was also evaluated using 310 cases from the NEJM journal, excluding those without definitive final diagnoses or exceeding GPT-4’s maximum context length. A comparison was made between conventional CoT prompting and the best-performing clinical diagnostic reasoning CoT prompts (specifically, reasoning for differential diagnosis) on the MedQA dataset.
Each prompt comprised two illustrative questions with rationales employing target reasoning techniques or few-shot learning. The study’s evaluation involved free-response questions from the USMLE and NEJM case report series, facilitating a rigorous comparison of different prompting strategies.
Physician authors, attending physicians, and an internal medicine resident participated in evaluating the responses generated by the language models. Each question was independently assessed by two blinded physicians, and any disagreements were resolved by a third researcher. To ensure accuracy, physicians utilized software to verify the correctness of the answers when necessary.
Results: The study indicates that GPT-4 prompts can effectively emulate the clinical reasoning of clinicians without compromising diagnostic accuracy. This finding is crucial for assessing the reliability of Large Language Models (LLMs) in providing responses that can be trusted for patient care. The approach addresses the black box limitations of LLMs, moving them closer to safe and effective use in the field of medicine.
GPT-3.5 demonstrated accurate responses in 46% of evaluation questions with standard chain-of-thought (CoT) prompting and 31% with zero-shot-type non-CoT prompting. Among prompts related to clinical diagnostic reasoning, GPT-3.5 performed best with intuitive-type reasoning (48% versus 46%).
In comparison to classic chain-of-thought, GPT-3.5 exhibited significantly lower performance with analytical reasoning prompts (40%) and those for developing differential diagnoses (38%), while Bayesian inferences showed a non-significant decline (42%). The inter-rater consensus for GPT-3.5 evaluations using the MedQA data was 97%.
GPT-4 API encountered errors in 20 test questions, reducing the test dataset size to 498. However, GPT-4 displayed improved accuracy compared to GPT-3.5. GPT-4 achieved accuracies of 76%, 77%, 78%, 78%, and 72% with classical chain-of-thought, intuitive-type reasoning, differential diagnostic reasoning, analytical reasoning prompts, and Bayesian inferences, respectively. The inter-rater consensus for GPT-4 evaluations using the MedQA data was 99%.
For the NEJM dataset, GPT-4 demonstrated a 38% accuracy with conventional chain-of-thought versus 34% for prompts involving the formulation of differential diagnoses (a 4.2% difference). The inter-rater consensus for GPT-4 NEJM assessments was 97%. GPT-4 responses and rationales for the entire NEJM dataset showed that prompts encouraging step-by-step reasoning and focusing on a single diagnostic reasoning strategy yielded better performance compared to those combining multiple strategies.
In summary, the study findings indicate that while GPT-3.5 and GPT-4 exhibit improved reasoning abilities, there is no significant improvement in accuracy. GPT-4 performed similarly with conventional and intuitive-type reasoning chain-of-thought prompts but displayed lower performance with analytical and differential diagnosis prompts. Bayesian inferences and chain-of-thought prompting also showed decreased performance compared to classical CoT. The authors suggest three possible explanations for these differences, including distinct reasoning mechanisms of GPT-4, post-hoc diagnostic evaluations in desired reasoning formats, or achieving maximum precision with the provided vignette data.
- Savage T, Nayak A, Gallo R, et al. (2024). Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. npj Digit. Med. doi: 10.1038/s41746-024-01010-1. https://www.nature.com/articles/s41746-024-01010-1