Testing the Accuracy of Modern LLMS in Answering General Medical Prompts (original) (raw)

Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model

Background: Natural language processing models such as ChatGPT can generate text-based content and are poised to become a major information source in medicine and beyond. The accuracy and completeness of ChatGPT for medical queries is not known. Methods: Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes/no) or descriptive answers. The physicians then graded ChatGPT-generated answers to these questions for accuracy (6-point Likert scale; range 1 – completely incorrect to 6 – completely correct) and completeness (3-point Likert scale; range 1 – incomplete to 3 - complete plus additional context). Scores were summarized with descriptive statistics and compared using Mann-Whitney U or Kruskal-Wallis testing. Results: Across all questions (n=284), median accuracy score was 5.5 (between almost completely and completely correct) with mean score of 4.8 (between mostly and almost co...

Performance of ChatGPT on USMLE: Unlocking the Potential of Large Language Models for AI-Assisted Medical Education

arXiv (Cornell University), 2023

Artificial intelligence is gaining traction in more ways than ever before. The popularity of language models and AI-based businesses has soared since ChatGPT was made available to the general public via OpenAI. It is becoming increasingly common for people to use ChatGPT both professionally and personally. Considering the widespread use of ChatGPT and the reliance people place on it, this study determined how reliable ChatGPT can be for answering complex medical and clinical questions. Harvard University gross anatomy along with the United States Medical Licensing Examination (USMLE) questionnaire were used to accomplish the objective. The paper evaluated the obtained results using a 2-way ANOVA and posthoc analysis. Both showed systematic covariation between format and prompt. Furthermore, the physician adjudicators independently rated the outcome's accuracy, concordance, and insight. As a result of the analysis, ChatGPT-generated answers were found to be more context-oriented and represented a better model for deductive reasoning than regular Google search results. Furthermore, ChatGPT obtained 58.8% on logical questions and 60% on ethical questions. This means that the ChatGPT is approaching the passing range for logical questions and has crossed the threshold for ethical questions. The paper believes ChatGPT and other language learning models can be invaluable tools for e-learners; however, the study suggests that there is still room to improve their accuracy. In order to improve ChatGPT's performance in the future, further research is needed to better understand how it can answer different types of questions.

Towards Expert-Level Medical Question Answering with Large Language Models

arXiv (Cornell University), 2023

Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.

Large Language Models Encode Clinical Knowledge

arXiv (Cornell University), 2022

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal important limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLM models for clinical applications.

A large language model for electronic health records

npj digital medicine, 2022

There is an increasing interest in developing artificial intelligence (AI) systems to process and interpret electronic health records (EHRs). Natural language processing (NLP) powered by pretrained language models is the key technology for medical AI systems utilizing clinical narratives. However, there are few clinical language models, the largest of which trained in the clinical domain is comparatively small at 110 million parameters (compared with billions of parameters in the general domain). It is not clear how large clinical language models with billions of parameters can help medical AI systems utilize unstructured EHRs. In this study, we develop from scratch a large clinical language model-GatorTron-using >90 billion words of text (including >82 billion words of de-identified clinical text) and systematically evaluate it on five clinical NLP tasks including clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference (NLI), and medical question answering (MQA). We examine how (1) scaling up the number of parameters and (2) scaling up the size of the training data could benefit these NLP tasks. GatorTron models scale up the clinical language model from 110 million to 8.9 billion parameters and improve five clinical NLP tasks (e.g., 9.6% and 9.5% improvement in accuracy for NLI and MQA), which can be applied to medical AI systems to improve healthcare delivery.

CLEAR: Pilot Testing of a Tool to Standardize Assessment of the Quality of Health Information Generated by Artificial Intelligence-Based Models

Artificial intelligence (AI)-based conversational models, such as ChatGPT, Microsoft Bing, and Google Bard, emerged as valuable sources of health information for the lay individuals. However, the accuracy of information provided by these AI models remains a significant concern. This pilot study aimed to test a new tool referred to as “CLEAR”, designed to assess the quality of health information delivered by AI-based models. Tool development involved a literature review on health information quality, followed by initial establishment of the CLEAR tool comprising five items that aimed to assess the following: completeness of content in response to the prompt, lack of false information, evidence support, appropriateness, and relevance of the generated content. Each item was scored on a 5-point Likert scale from excellent to poor. Content validity was checked by expert review of the initial items. Pilot testing involved 32 healthcare professionals using the CLEAR tool to assess content ...

Generative Large Language Models are autonomous practitioners of evidence-based medicine

arXiv (Cornell University), 2024

Background Generative Large Language Models (LLMs) have emerged as versatile tools in healthcare, demonstrating the ability to regurgitate clinical knowledge and pass medical licensing exams. Despite their promise, they have been largely treated as slow, imperfect information retrieval tools and face limitations such as data staleness, resource intensity, and manufacturing incorrect text-reducing their applicability to dynamic healthcare settings. Methods This study explored the functionality of both proprietary and open-source LLMs to act as autonomous agents within a simulated tertiary care medical center. Real-world clinical cases across multiple specialties were structured into JSON files and presented to agents for solution using the resources available to a human physician. Agents were created using LLMs in combination with natural language prompts, tools with real-world interactions, and standard programming techniques. The technique of Retrieval Augmented Generation was used to provide agents with updated context whenever appropriate. Expert clinicians collected and evaluated model responses across several performance metrics including correctness of the final answer, judicious use of tools, guideline conformity, and resistance to hallucinations. Findings Agents showed varied performance across specialties, with proprietary models (e.g., GPT-4) generally outperforming open-source models. The use of Retrieval Augmented Generation (RAG) improved guideline adherence and contextually relevant responses for the best performing model. Interpretation LLMs can effectively function as autonomous agents in healthcare by leveraging their generative capabilities and integrating with real-world data. The study highlights the potential of LLMs to enhance decision-making in clinical settings through tailored prompts and retrieval tools. However, the variability in model performance and the necessity for ongoing manual evaluation suggest that further refinements in LLM technology and operational protocols are needed to optimize their utility in healthcare.

Role of large language models in improving provider-patient experience and interaction efficiency: A scoping review

Role of large language models in improving provider–patient experience and interaction efficiency: A scoping review, 2024

Large language models (LLMs) have rapidly emerged as transformative tools across multiple domains, including healthcare. The ability of LLMs to process vast amounts of data and generate human-like responses has facilitated their integration into patient care, particularly in enhancing communication, improving patient satisfaction, and streamlining administrative processes. Despite this potential, there are concerns regarding their accuracy, reliability, and ethical use in clinical settings. This scoping review aims to investigate and map the current literature on the use of LLMs in improving provider-patient experience and interaction efficiency. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews guidelines, we conducted a systematic search of Ovid MEDLINE, PubMed, and Google Scholar databases to identify relevant articles published between January 2015 and June 2024. Of the 3568 articles initially screened, 47 satisfied the inclusion criteria. These articles spanned 13 countries and encompassed diverse healthcare settings. Thematic areas of LLM utilization included improving communication between patients and healthcare providers, resolving patient inquiries, enhancing patient education, and increasing operational efficiency. Although numerous studies have yielded positive outcomes, significant challenges related to data accuracy, hallucinations, bias, and ethical concerns remain. LLMs can considerably improve patient experience in healthcare, particularly in areas of communication, education, and administrative efficiency. However, concerns regarding accuracy, ethical implications, and the need for rigorous safeguards to prevent misinformation impede their widespread adoption. Future research should focus on developing context-specific LLMs tailored to healthcare environments and addressing the identified limitations to optimize their implementation in clinical practice.

Assessing the Responses of Large Language Models (ChatGPT-4, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Breast Imaging: A Study on Readability and Accuracy

Background Large language models (LLMs), such as ChatGPT-4, Gemini, and Microsoft Copilot, have been instrumental in various domains, including healthcare, where they enhance health literacy and aid in patient decisionmaking. Given the complexities involved in breast imaging procedures, accurate and comprehensible information is vital for patient engagement and compliance. This study aims to evaluate the readability and accuracy of the information provided by three prominent LLMs, ChatGPT-4, Gemini, and Microsoft Copilot, in response to frequently asked questions in breast imaging, assessing their potential to improve patient understanding and facilitate healthcare communication. Methodology We collected the most common questions on breast imaging from clinical practice and posed them to LLMs. We then evaluated the responses in terms of readability and accuracy. Responses from LLMs were analyzed for readability using the Flesch Reading Ease and Flesch-Kincaid Grade Level tests and for accuracy through a radiologist-developed Likert-type scale. Results The study found significant variations among LLMs. Gemini and Microsoft Copilot scored higher on readability scales (p < 0.001), indicating their responses were easier to understand. In contrast, ChatGPT-4 demonstrated greater accuracy in its responses (p < 0.001). Conclusions While LLMs such as ChatGPT-4 show promise in providing accurate responses, readability issues may limit their utility in patient education. Conversely, Gemini and Microsoft Copilot, despite being less accurate, are more accessible to a broader patient audience. Ongoing adjustments and evaluations of these models are essential to ensure they meet the diverse needs of patients, emphasizing the need for continuous improvement and oversight in the deployment of artificial intelligence technologies in healthcare.