AI Takes a Leap in Healthcare: GPT-3.5 and GPT-4 Demonstrate Proficiency in Clinical Reasoning

January 29, 2024 Off By admin

In a recent publication in npj Digital Medicine, researchers created diagnostic reasoning prompts to assess the ability of large language models (LLMs) to simulate diagnostic clinical reasoning.

Large Language Models (LLMs), artificial intelligence-based systems trained on extensive text data, have demonstrated human-like proficiency in tasks such as generating clinical notes and successfully completing medical exams. Despite these capabilities, comprehending their diagnostic reasoning skills is vital for their effective integration into clinical care.

Recent research has focused on assessing LLMs in handling open-ended clinical questions, highlighting the potential of advanced models like GPT-4 to discern complex patient cases. The ongoing development of prompt engineering is addressing the variability in LLM performance, considering the influence of different prompts and question types.

In this study, researchers conducted an assessment of diagnostic reasoning performed by GPT-3.5 and GPT-4 in response to open-ended clinical questions. The hypothesis was that GPT models could surpass conventional chain-of-thought (CoT) prompting when equipped with diagnostic reasoning prompts.

The study utilized the revised MedQA United States Medical Licensing Exam (USMLE) dataset and the New England Journal of Medicine (NEJM) case series. The comparison involved evaluating conventional CoT prompting against various diagnostic logic prompts inspired by cognitive processes such as forming a differential diagnosis, analytical reasoning, Bayesian inferences, and intuitive reasoning.

The primary objective was to investigate whether large-language models, guided by specialized prompts, could emulate clinical reasoning skills by combining clinical expertise with advanced prompting techniques.

Prompt engineering played a crucial role in generating prompts for diagnostic reasoning, transforming questions into free responses by eliminating multiple-choice options. Specifically, only step II and step III questions from the USMLE dataset, focusing on patient diagnosis, were included. Each round of prompt engineering involved evaluating GPT-3.5 accuracy using the MEDQA training set, with distinct sets for training (95 questions) and testing (518 questions).

GPT-4 performance was also evaluated using 310 cases from the NEJM journal, excluding those without definitive final diagnoses or exceeding GPT-4’s maximum context length. A comparison was made between conventional CoT prompting and the best-performing clinical diagnostic reasoning CoT prompts (specifically, reasoning for differential diagnosis) on the MedQA dataset.

Each prompt comprised two illustrative questions with rationales employing target reasoning techniques or few-shot learning. The study’s evaluation involved free-response questions from the USMLE and NEJM case report series, facilitating a rigorous comparison of different prompting strategies.

Physician authors, attending physicians, and an internal medicine resident participated in evaluating the responses generated by the language models. Each question was independently assessed by two blinded physicians, and any disagreements were resolved by a third researcher. To ensure accuracy, physicians utilized software to verify the correctness of the answers when necessary.

Results: The study indicates that GPT-4 prompts can effectively emulate the clinical reasoning of clinicians without compromising diagnostic accuracy. This finding is crucial for assessing the reliability of Large Language Models (LLMs) in providing responses that can be trusted for patient care. The approach addresses the black box limitations of LLMs, moving them closer to safe and effective use in the field of medicine.

GPT-3.5 demonstrated accurate responses in 46% of evaluation questions with standard chain-of-thought (CoT) prompting and 31% with zero-shot-type non-CoT prompting. Among prompts related to clinical diagnostic reasoning, GPT-3.5 performed best with intuitive-type reasoning (48% versus 46%).

In comparison to classic chain-of-thought, GPT-3.5 exhibited significantly lower performance with analytical reasoning prompts (40%) and those for developing differential diagnoses (38%), while Bayesian inferences showed a non-significant decline (42%). The inter-rater consensus for GPT-3.5 evaluations using the MedQA data was 97%.

GPT-4 API encountered errors in 20 test questions, reducing the test dataset size to 498. However, GPT-4 displayed improved accuracy compared to GPT-3.5. GPT-4 achieved accuracies of 76%, 77%, 78%, 78%, and 72% with classical chain-of-thought, intuitive-type reasoning, differential diagnostic reasoning, analytical reasoning prompts, and Bayesian inferences, respectively. The inter-rater consensus for GPT-4 evaluations using the MedQA data was 99%.

For the NEJM dataset, GPT-4 demonstrated a 38% accuracy with conventional chain-of-thought versus 34% for prompts involving the formulation of differential diagnoses (a 4.2% difference). The inter-rater consensus for GPT-4 NEJM assessments was 97%. GPT-4 responses and rationales for the entire NEJM dataset showed that prompts encouraging step-by-step reasoning and focusing on a single diagnostic reasoning strategy yielded better performance compared to those combining multiple strategies.

In summary, the study findings indicate that while GPT-3.5 and GPT-4 exhibit improved reasoning abilities, there is no significant improvement in accuracy. GPT-4 performed similarly with conventional and intuitive-type reasoning chain-of-thought prompts but displayed lower performance with analytical and differential diagnosis prompts. Bayesian inferences and chain-of-thought prompting also showed decreased performance compared to classical CoT. The authors suggest three possible explanations for these differences, including distinct reasoning mechanisms of GPT-4, post-hoc diagnostic evaluations in desired reasoning formats, or achieving maximum precision with the provided vignette data.

Journal reference:

Savage T, Nayak A, Gallo R, et al. (2024). Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. npj Digit. Med. doi: 10.1038/s41746-024-01010-1. https://www.nature.com/articles/s41746-024-01010-1

Fixing the bioinformatics bottleneck to follow the COVID-19 pandemic faster- A suggestion from resea...

Innovative cTWAS Tool Enhances Identification of Disease-Causing Genes

Generative AI Deciphers Social Determinants of Health in Doctor’s Notes

The Eyes as a Gateway to Multiple Diseases

AI Method Holds Promise for Precision Medicine

Bioinformatics Boom: How Infectious Diseases & Global Events Shape the 2023-2027 Forecast

Revolutionary Genome Editing Tools Herald New Era in Precision Medicine

Choosing the Right Bioinformatic Tool for Microbiome Analysis

Researchers develop a robust technique for analysing massive patient datasets

CHRM1 Protein Unveiled as a Therapeutic Target in Advanced Prostate Cancer

The World's First Logical Quantum Processor and Its Game-Changing Impact on Science and Technology

Bioinformatics Software Market to Reach $45.2 Billion by 2030: A New Era of Discovery