Machine learning analysis of topic modeling re-ranking of clinical records (original) (raw)

Using phrases and document metadata to improve topic modeling of clinical reports

Journal of biomedical informatics, 2016

Probabilistic topic models provide an unsupervised method for analyzing unstructured text, which have the potential to be integrated into clinical automatic summarization systems. Clinical documents are accompanied by metadata in a patient's medical history and frequently contains multiword concepts that can be valuable for accurately interpreting the included text. While existing methods have attempted to address these problems individually, we present a unified model for free-text clinical documents that integrates contextual patient- and document-level data, and discovers multi-word concepts. In the proposed model, phrases are represented by chained n-grams and a Dirichlet hyper-parameter is weighted by both document-level and patient-level context. This method and three other Latent Dirichlet allocation models were fit to a large collection of clinical reports. Examples of resulting topics demonstrate the results of the new model and the quality of the representations are ev...

Topic Modeling for Classification of Clinical Reports

Electronic health records (EHRs) contain important clinical information about patients. Efficient and effective use of this information could supplement or even replace manual chart review as a means of studying and improving the quality and safety of healthcare delivery. However, some of these clinical data are in the form of free text and require pre-processing before use in automated systems. A common free text data source is radiology reports, typically dictated by radiologists to explain their interpretations. We sought to demonstrate machine learning classification of computed tomography (CT) imaging reports into binary outcomes, i.e. positive and negative for fracture, using regular text classification and classifiers based on topic modeling. Topic modeling provides interpretable themes (topic distributions) in reports, a representation that is more compact than the commonly used bag-of-words representation and can be processed faster than raw text in subsequent automated processes. We demonstrate new classifiers based on this topic modeling representation of the reports. Aggregate topic classifier (ATC) and confidence-based topic classifier (CTC) use a single topic that is determined from the training dataset based on different measures to classify the reports on the test dataset. Alternatively, similarity-based topic classifier (STC) measures the similarity between the reports' topic distributions to determine the predicted class. Our proposed topic modeling-based classifier systems are shown to be competitive with existing text classification techniques and provides an efficient and interpretable representation.

Identifying patterns of associated-conditions through topic models of Electronic Medical Records

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2016

Multiple adverse health conditions co-occurring in a patient are typically associated with poor prognosis and increased office or hospital visits. Developing methods to identify patterns of co-occurring conditions can assist in diagnosis. Thus, identifying patterns of association among co-occurring conditions is of growing interest. In this paper, we report preliminary results from a data-driven study, in which we apply a machine learning method, namely, topic modeling, to Electronic Medical Records (EMRs), aiming to identify patterns of associated conditions. Specifically, we use the well-established Latent Dirichlet Allocation (LDA), a method based on the idea that documents can be modeled as a mixture of latent topics, where each topic is a distribution over words. In our study, we adapt the LDA model to identify latent topics in patients' EMRs. We evaluate the performance of our method both qualitatively and quantitatively, and show that the obtained topics indeed align well with distinct medical phenomena characterized by co-occurring conditions.

Topic Modeling Based Classification of Clinical Reports

2013

ii iii We selected 25 papers for publishing out of the 52 submissions (an acceptance rate of 48%) that we received from students from a wide variety of countries. Three papers were presented orally in one of the parallel sessions of the main conference. The other 22 papers were shown as posters as part of the poster session of the main conference.

Discovering associations among diagnosis groups using topic modeling

AMIA Joint Summits on Translational Science proceedings AMIA Summit on Translational Science, 2014

With the rapid growth of electronic medical records (EMR), there is an increasing need of automatically extract patterns or rules from EMR data with machine learning and data mining technqiues. In this work, we applied unsupervised statistical model, latent Dirichlet allocations (LDA), to cluster patient diagnoics groups from Rochester Epidemiology Projects (REP). The initial results show that LDA holds the potential for broad application in epidemiogloy as well as other biomedical studies due to its unsupervised nature and great interpretive power.

Incorporating Statistical Topic Models in the Retrieval of Healthcare Documents

We present a framework based on Statistical Topics Models, Language Models, Information Extraction, and Ontology Analysis to retrieve healthcare related documents for the CLEF eHealth 2013 Task 3. In this framework we add global information based on latent topics from the documents to improve the document retrieval. We perform six different experiments which consist of a baseline and six variants of the model. Preliminary results show that the use of Language Models with a bag of words scheme results better estimates. However model tunning in the Topic Based model is required to achieve optimal results.

Mining heterogeneous clinical notes by multi-modal latent topic model

PLOS ONE

Latent knowledge can be extracted from the electronic notes that are recorded during patient encounters with the health system. Using these clinical notes to decipher a patient’s underlying comorbidites, symptom burdens, and treatment courses is an ongoing challenge. Latent topic model as an efficient Bayesian method can be used to model each patient’s clinical notes as “documents” and the words in the notes as “tokens”. However, standard latent topic models assume that all of the notes follow the same topic distribution, regardless of the type of note or the domain expertise of the author (such as doctors or nurses). We propose a novel application of latent topic modeling, using multi-note topic model (MNTM) to jointly infer distinct topic distributions of notes of different types. We applied our model to clinical notes from the MIMIC-III dataset to infer distinct topic distributions over the physician and nursing note types. Based on manual assessments made by clinicians, we obser...

Redundancy-Aware Topic Modeling for Patient Record Notes

PLoS ONE, 2014

The clinical notes in a given patient record contain much redundancy, in large part due to clinicians' documentation habit of copying from previous notes in the record and pasting into a new note. Previous work has shown that this redundancy has a negative impact on the quality of text mining and topic modeling in particular. In this paper we describe a novel variant of Latent Dirichlet Allocation (LDA) topic modeling, Red-LDA, which takes into account the inherent redundancy of patient records when modeling content of clinical notes. To assess the value of Red-LDA, we experiment with three baselines and our novel redundancy-aware topic modeling method: given a large collection of patient records, (i) apply vanilla LDA to all documents in all input records; (ii) identify and remove all redundancy by chosing a single representative document for each record as input to LDA; (iii) identify and remove all redundant paragraphs in each record, leaving partial, nonredundant documents as input to LDA; and (iv) apply Red-LDA to all documents in all input records. Both quantitative evaluation carried out through log-likelihood on held-out data and topic coherence of produced topics and qualitative assessement of topics carried out by physicians show that Red-LDA produces superior models to all three baseline strategies. This research contributes to the emerging field of understanding the characteristics of the electronic health record and how to account for them in the framework of data mining. The code for the two redundancy-elimination baselines and Red-LDA is made publicly available to the community. Citation: Cohen R, Aviram I, Elhadad M, Elhadad N (2014) Redundancy-Aware Topic Modeling for Patient Record Notes. PLoS ONE 9(2): e87555.

Topic Modeling Technique for Text Mining Over Biomedical Text Corpora through Hybrid Inverse Documents Frequency and Fuzzy K-Means Clustering

IEEE Access

Text data plays an imperative role in the biomedical domain. As patient's data comprises of a huge amount of text documents in a non-standardized format. In order to obtain the relevant data, the text documents pose a lot of challenging issues for data processing. Topic modeling is one of the popular techniques for information retrieval based on themes from the biomedical documents. In topic modeling discovering the precise topics from the biomedical documents is a challenging task. Furthermore, in biomedical text documents, the redundancy puts a negative impact on the quality of text mining as well. Therefore, the rapid growth of unstructured documents entails machine learning techniques for topic modeling capable of discovering precise topics. In this paper, we proposed a topic modeling technique for text mining through hybrid inverse document frequency and machine learning fuzzy k-means clustering algorithm. The proposed technique ameliorates the redundancy issue and discovers precise topics from the biomedical text documents. The proposed technique generates local and global term frequencies through the bag-of-words (BOW) model. The global term weighting is calculated through the proposed hybrid inverse documents frequency and Local term weighting is computed with term frequency. The robust principal component analysis is used to remove the negative impact of higher dimensionality on the global term weights. Afterward, the classification and clustering for text mining are performed with a probability of topics in the documents. The classification is performed through discriminant analysis classifier whereas the clustering is done through the k-means clustering. The performance of clustering is evaluated with Calinsiki-Har-abasz (CH) index internal validation method. The proposed toping modeling technique is evaluated on six standard datasets namely Ohsumed, MuchMore Springer Corpus, GENIA corpus, Bioxtext, tweets and WSJ redundant corpus for experimentation. The proposed topic modeling technique exhibits high performance on classification and clustering in text mining compared to baseline topic models like FLSA, LDA, and LSA. Moreover, the execution time of the proposed topic modeling technique remains stable for different numbers of topics.

Automated topic analysis for restricted scope health corpora: methodology and comparison with human performance

Proceedings of the Annual Hawaii International Conference on System Sciences, 2021

This paper addresses the problem of identifying topics which describe information content, in restricted size sets of scientific papers extracted from publication databases. Conventional computational approaches, based on natural language processing using unsupervised classification algorithms, typically require large numbers of papers to achieve adequate training. The approach presented here uses a simpler word-frequency-based approach coupled with context modeling. An example is provided of its application to corpora resulting from a curated literature search site for COVID-19 research publications. The results are compared with a conventional human-based approach, indicating partial overlap in the topics identified. The findings suggest that computational approaches may provide an alternative to human expert topic analysis, provided adequate contextual models are available.