Towards Expert-Level Medical Question Answering with Large Language Models (original) (raw)

Entity-Enriched Neural Models for Clinical Question Answering

Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, 2020

We explore state-of-the-art neural models for question answering on electronic medical records and improve their ability to generalize better on previously unseen (paraphrased) questions at test time. We enable this by learning to predict logical forms as an auxiliary task along with the main task of answer span detection. The predicted logical forms also serve as a rationale for the answer. Further, we also incorporate medical entity information in these models via the ERNIE (Zhang et al., 2019a) architecture. We train our models on the large-scale emrQA dataset and observe that our multi-task entity-enriched models generalize to paraphrased questions ∼ 5% better than the baseline BERT model.

Large Language Models Encode Clinical Knowledge

arXiv (Cornell University), 2022

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal important limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLM models for clinical applications.

Testing the Accuracy of Modern LLMS in Answering General Medical Prompts

International Journal of Social Science & Economic Research

The rising use of large language models (LLMs) for answering medical questions necessitates an evaluation of their accuracy, especially given the implications for public health. This study employed a comprehensive test suite of 500 medical prompts, evaluated by a panel of medical experts for factual accuracy, contextual relevance, and potential risk. The responses from stateof-the-art LLMs were also compared with answers from a control group of medical students. Results indicated a high level of accuracy among LLMs, with a median score of 88%. While LLMs performed well on general wellness questions (92% accuracy), they were less reliable for specialized medical queries (80% accuracy). The control group of medical students outperformed LLMs in answering specialized medical questions. In conclusion, while LLMs demonstrate a high degree of factual accuracy for general medical information, they are less reliable for specialized or complex health-related queries. Given their widespread use, LLMs could be a preliminary source for general medical advice, but their limitations underscore the need for consulting experts for specialized medical conditions. Future work should focus on enhancing the models' capabilities in specialized domains and evaluating the ethical implications of using LLMs for medical information dissemination. This study serves as a baseline for the responsible use of AI in healthcare.

Medical Exam Question Answering with Large-scale Reading Comprehension

2018

Reading and understanding text is one important component in computer aided diagnosis in clinical medicine, also being a major research problem in the field of NLP. In this work, we introduce a question-answering task called MedQA to study answering questions in clinical medicine using knowledge in a large-scale document collection. The aim of MedQA is to answer real-world questions with large-scale reading comprehension. We propose our solution SeaReader--a modular end-to-end reading comprehension model based on LSTM networks and dual-path attention architecture. The novel dual-path attention models information flow from two perspectives and has the ability to simultaneously read individual documents and integrate information across multiple documents. In experiments our SeaReader achieved a large increase in accuracy on MedQA over competing models. Additionally, we develop a series of novel techniques to demonstrate the interpretation of the question answering process in SeaReader.

Large Biomedical Question Answering Models with ALBERT and ELECTRA

2021

The majority of systems that participated in the BioASQ8 challenge are based on BioBERT model [1]. We adopt a different approach in our participation in the BioASQ9B challenge by taking advantage of large biomedical language models that are built on ELECTRA [2] and ALBERT [3] architectures, including both BioM-ELECTRA and BioM-ALBERT [4]. Moreover, we examine the advantage of transferability [5] between BioASQ and other text classification tasks such as The Multi-Genre Natural Language Inference (MultiNLI) [6]. Our results show that both BioM-ELECTRA and BioM-ALBERT significantly outperform the BioBERT model on the BioASQ9B task.

An architecture for complex clinical question answering

2010

We present the software architecture for a coming community resource, the Multi-source Integrated Platform for Answering Clinical Questions (MiPACQ). This system is designed to capitalize on state-of-the-art semantic annotation of text to answer complex clinical practice questions and to enable clinical investigators to perform pioneering data mining tasks. The architecture allows easy customization to facilitate integration with different electronic medical records systems and data sources, to retrain machine learning (ML) classifiers to handle domain-specific details, to utilize new annotators and ML algorithms as they become available, and to enhance, replace or add new core system components.

Learning to Ask Like a Physician

Proceedings of the 4th Clinical Natural Language Processing Workshop

Existing question answering (QA) datasets derived from electronic health records (EHR) are artificially generated and consequently fail to capture realistic physician information needs. We present Discharge Summary Clinical Questions (DiSCQ), a newly curated question dataset composed of 2,000+ questions paired with the snippets of text (triggers) that prompted each question. The questions are generated by medical experts from 100+ MIMIC-III discharge summaries. We analyze this dataset to characterize the types of information sought by medical experts. We also train baseline models for trigger detection and question generation (QG), paired with unsupervised answer retrieval over EHRs. Our baseline model is able to generate high quality questions in over 62% of cases when prompted with human selected triggers. We release this dataset (and all code to reproduce baseline model results) to facilitate further research into realistic clinical QA and QG. 1

Pre-trained Language Model for Biomedical Question Answering

Machine Learning and Knowledge Discovery in Databases, 2020

The recent success of question answering systems is largely attributed to pre-trained language models. However, as language models are mostly pre-trained on general domain corpora such as Wikipedia, they often have difficulty in understanding biomedical questions. In this paper, we investigate the performance of BioBERT, a pre-trained biomedical language model, in answering biomedical questions including factoid, list, and yes/no type questions. BioBERT uses almost the same structure across various question types and achieved the best performance in the 7th BioASQ Challenge (Task 7b, Phase B). BioBERT pre-trained on SQuAD or SQuAD 2.0 easily outperformed previous state-of-theart models. BioBERT obtains the best performance when it uses the appropriate pre-/post-processing strategies for questions, passages, and answers.

The MiPACQ clinical question answering system

AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium, 2011

The Multi-source Integrated Platform for Answering Clinical Questions (MiPACQ) is a QA pipeline that integrates a variety of information retrieval and natural language processing systems into an extensible question answering system. We present the system's architecture and an evaluation of MiPACQ on a human-annotated evaluation dataset based on the Medpedia health and medical encyclopedia. Compared with our baseline information retrieval system, the MiPACQ rule-based system demonstrates 84% improvement in Precision at One and the MiPACQ machine-learning-based system demonstrates 134% improvement. Other performance metrics including mean reciprocal rank and area under the precision/recall curves also showed significant improvement, validating the effectiveness of the MiPACQ design and implementation.