Topic Modeling for Classification of Clinical Reports (original) (raw)

Topic Modeling Based Classification of Clinical Reports

2013

ii iii We selected 25 papers for publishing out of the 52 submissions (an acceptance rate of 48%) that we received from students from a wide variety of countries. Three papers were presented orally in one of the parallel sessions of the main conference. The other 22 papers were shown as posters as part of the poster session of the main conference.

Multi-topic Aspects in Clinical Text Classification

2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007), 2007

This paper investigates multi-topic aspects in automatic classification of clinical free text. In many practical situations, we need to deal with documents overlapping with multiple topics. Automatic assignment of multiple ICD-9-CM codes to clinical free text in medical records is a typical multi-topic text classification problem. In this paper, we facilitate two different views on multi-topics. The Closed Topic Assumption (CTA) regards an absence of topics for a document as an explicit declaration that this document does not belong to those absent topics. In contrast, the Open Topic Assumption (OTA) considers the missing topics as neutral topics. This paper compares performances of various interpretations of a multi-topic Text Classification problem into a Machine Learning problem. Experimental results show that the characteristics of multi-topic assignments in the Medical NLP Challenge data is OTA-oriented.

Text Classification for Medical Informatics: A Comparison of Models for Data Mining Radiological Medical Records

Asia Pacific World, 2011

In this study we analyze 1024 free text digital records from pediatric patients who underwent CT scanning. The free text reports are from the digital records of patients who underwent CT scanning in a one-year period in 2004 at the Nagasaki University Medical Hospital in Japan. We use text mining algorithms to model the records. Each scan was evaluated by an expert in the field and classified as to whether the CT scan was necessary or not. A model was built that predicts this classification. The results show that models developed on raw text could contribute significantly to the physician's decision to order a CT scan. Practically this is important because radiation at levels ordinarily used for CT scanning may pose significant health risks especially to children and thus the modeling of unnecessary scanning may lead to less exposure to radiation.

Automatic classification of radiological reports for clinical care

Artificial intelligence in medicine, 2018

Radiological reporting generates a large amount of free-text clinical narratives, a potentially valuable source of information for improving clinical care and supporting research. The use of automatic techniques to analyze such reports is necessary to make their content effectively available to radiologists in an aggregated form. In this paper we focus on the classification of chest computed tomography reports according to a classification schema proposed for this task by radiologists of the Italian hospital ASST Spedali Civili di Brescia. The proposed system is built exploiting a training data set containing reports annotated by radiologists. Each report is classified according to the schema developed by radiologists and textual evidences are marked in the report. The annotations are then used to train different machine learning based classifiers. We present in this paper a method based on a cascade of classifiers which make use of a set of syntactic and semantic features. The resu...

Using phrases and document metadata to improve topic modeling of clinical reports

Journal of biomedical informatics, 2016

Probabilistic topic models provide an unsupervised method for analyzing unstructured text, which have the potential to be integrated into clinical automatic summarization systems. Clinical documents are accompanied by metadata in a patient's medical history and frequently contains multiword concepts that can be valuable for accurately interpreting the included text. While existing methods have attempted to address these problems individually, we present a unified model for free-text clinical documents that integrates contextual patient- and document-level data, and discovers multi-word concepts. In the proposed model, phrases are represented by chained n-grams and a Dirichlet hyper-parameter is weighted by both document-level and patient-level context. This method and three other Latent Dirichlet allocation models were fit to a large collection of clinical reports. Examples of resulting topics demonstrate the results of the new model and the quality of the representations are ev...

Machine learning analysis of topic modeling re-ranking of clinical records

Smart Biosensors in Medical Care, 2020

Technologies in Big data have improved the analysis of clinical information for better understanding diseases in order to provide more efficient diagnoses. An online healthcare system has created huge data by record maintaining, taking into account acceptable requirements and the patient's care. These clinical records are in files that pose a challenge for data processing and finding relevant documents. In this work, we used a method that combines Statistical Topic Models, Language Models and Natural Language Processing, in order to retrieve clinical records. On the other hand, for analysing large clinical records in the form of documents, Topic models are used to finding related clusters of disease patterns. Here, it is explored the decomposition of clinical record summaries into topics which enables the effective clustering of relevant documents based on the topic under study. Clinical documents selected in a Topic-based approach give proper information to the users for better understanding and derive insights from the related data. In our proposed method, it is used clustering-based semantic similarity topic modelling in order to summarizing the clinical reports based on Latent Dirichlet Allocation (LDA) in a MapReduce framework. Automated unsupervised analysis of LDA models are used to identify different disease patterns and to rank topic significance. In this, topic and keyword re-ranking methods which assist physicians to get improved information through the LDA-obtained topics. The experimental assessment confirmed the value of the used methods in clinical documents summarization.

Automated Outcome Classification of Emergency Department Computed Tomography Imaging Reports

Academic Emergency Medicine, 2013

Background: Reliably abstracting outcomes from free-text electronic health records remains a challenge. While automated classification of free text has been a popular medical informatics topic, performance validation using real-world clinical data has been limited. The two main approaches are linguistic (natural language processing [NLP]) and statistical (machine learning). The authors have developed a hybrid system for abstracting computed tomography (CT) reports for specified outcomes.

Automatic Classification of Critical Findings in Radiology Reports

2017

Communication of "actionable" findings in radiology reports is an important part of high quality medical care. Distinguishing radiology reports with "actionable" findings from other reports is currently a function of the radiologist and largely a manual process. This paper describes a system for automatic classification of patient's radiology reports as it relates to the degree of severity of "actionable" findings provided by the radiology department at University of Massachusetts Medical School. This is done by using machine learning classifier on text based features. Several machine learning classification algorithms are evaluated and compared. Random forest classifier performed the best in this case while other classification methods also performed decently.

Semi-Supervised Natural Language Approach for Fine-Grained Classification of Medical Reports

2019

Although machine learning has become a powerful tool to augment doctors in clinical analysis, the immense amount of labeled data that is necessary to train supervised learning approaches burdens each development task as time and resource intensive. The vast majority of dense clinical information is stored in written reports, detailing pertinent patient information. The challenge with utilizing natural language data for standard model development is due to the complex and unstructured nature of the modality. In this research, a model pipeline was developed to utilize an unsupervised approach to train an encoder-language model, a bidirectional recurrent neural network, to generate document encodings; which then can be used as features passed into a decoder-classifier model that requires magnitudes less labeled data than previous approaches to differentiate between fine-grained disease classes accurately. The language model was trained on unlabeled radiology reports from the Massachusetts General Hospital Radiology Department (n=218,159) and terminated with a loss of 1.62 and a word prediction accuracy of 62%. The classification models were trained on three labeled datasets of head CT studies of reported patients, presenting large vessel occlusion (n=1403), acute ischemic strokes (n=331), and intracranial hemorrhage (n=4350), to identify a variety of different findings directly from the radiology report data; resulting in AUCs of 0.98, 0.95, and 0.99, respectively, for the large vessel occlusion, acute ischemic stroke, and intracranial hemorrhage datasets. The output encodings are able to be used in conjunction with imaging data, to create models that can process a multitude of different modalities. The ability to automatically extract relevant features from textual data allows for faster model development and integration of Preprint. Under review.

Latent Topic Based Medical Data Classification

2011

This paper discusses the classification process for medical data. In this paper, we use the data from ACM KDDCup 2008 to demonstrate our classification process based on latent topic discovery. In this data set, the target set and outliers are quite different in their nature: target set is only 0.6% size in total, while the outliers consist of 99.4% of the data set. We use this data set as an example to show how we dealt with this extremely biased data set with latent topic discovery and noise reduction techniques. Our experiment faces two major challenge: (1) extremely distributed outliers, and (2) positive samples are far smaller than negative ones. We try to propose a suitable process flow to deal with these issues and get a best AUC result of 0.98.