Vandana Mukherjee - Academia.edu (original) (raw)

Papers by Vandana Mukherjee

Research paper thumbnail of Commensal or pathogen: computational vectorizing microbial genomes for<i>de novo</i>risk assessment and virulence feature discovery in<i>Klebsiella pneumoniae</i>

bioRxiv (Cold Spring Harbor Laboratory), May 13, 2024

Bacterial pathogenicity has traditionally focused on gene-level content with experimentally-confi... more Bacterial pathogenicity has traditionally focused on gene-level content with experimentally-confirmed functional properties. Hence, significant inferences are made based on similarity to known pathotypes and DNAbased genomic subtyping for risk. Herein, we achieved de novo prediction of human virulence in Klebsiella pneumoniae by expanding known virulence genes with spatially proximal gene discoveries linked by functional domain architectures across all prokaryotes. This approach identified gene ontology functions not typically associated with virulence sensu stricto. By leveraging machine learning models with these expanded discoveries, public genomes were assessed for virulence prediction using categorizations derived from isolation sources captured in available 1 Springer Nature 2021 L A T E X template 2 Commensal or pathogen: computationally vectorizing microbial genomes for de novo r metadata. Performance for de novo strain-level virulence prediction achieved 0.81 F1-Score. Virulence predictions using expanded "discovered" functional genetic content were superior to that restricted to extant virulence database content. Additionally, this approach highlighted the incongruence in relying on traditional phylogenetic subtyping for categorical inferences. Our approach represents an improved deconstruction of genome-scale datasets for functional predictions and risk assessment intended to advance public health surveillance of emerging pathogens.

Research paper thumbnail of Agent Simulation Using Path Telemetry for Modeling COVID-19 Workplace Hazard and Risk

Research paper thumbnail of Clockwork: A Discrete Event and Agent-Based Social Simulation Framework

Research Square (Research Square), Dec 13, 2023

Agent-based social simulation can be useful for creating digital twins of societies of interactin... more Agent-based social simulation can be useful for creating digital twins of societies of interacting autonomous agents. Such simulations are useful for testing hypotheses about behaviour and belief change in the presence of interventions. In this paper, we introduce Clockwork, an efficient multiagent simulation framework. The key contribution of this framework is integrating a behaviour-level simulation of autonomous agent schedules with a Discrete Event Simulator (DES) and a rich model of social interaction among agents modelled with Agent-Based Social Simulation (ABSS) methodology. Clockwork's ability to simulate deterministic and stochastic events and changes in individual agent models due to the influence of other agents through event-based emergent interactions is novel. We describe the design, architecture, and development of the Clockwork framework. Combining DES and ABSS in this way enables the modelling of detailed individual agent histories while maintaining population-level statistical distributions. We evaluated Clockwork with a real-life scenario modelling the digital twin of employees at a real-world worksite 1 Springer Nature 2021 L A T E X template 2 Clockwork: A Discrete Event and Agent-Based Social Simulation Framework to test the efficacy of various policies put in place to mitigate the risk of contracting COVID-19 through a hybrid or in-person work model.

Research paper thumbnail of Improving the Path from Diagnoses to Documentation: A Cognitive Review Tool for Clinical Notes and Administrative Records

PubMed, 2018

EMR systems are intended to improve patient-centered care management and hospital administrative ... more EMR systems are intended to improve patient-centered care management and hospital administrative processing. However, the information stored in EMRs can be disorganized, incomplete, or inconsistent, creating problems at the patient and system level. We present a technology that reconciles inconsistencies between clinical diagnoses and administrative records by analyzing free-text notes, problem lists and recorded diagnoses in real time. A fully integrated pipeline has been developed for efficient, knowledge-driven extraction, normalization, and matching of disease terms among structured and unstructured data, with modular precision of 94-98% on over 1000 patients. This cognitive data review tool improves the path from diagnosis to documentation, facilitating accurate and timely clinical and administrative decision-making.

Research paper thumbnail of A knowledge-based question answering system to provide cognitive assistance to radiologists

With the advent of computers and natural language processing, it is not surprising to see that hu... more With the advent of computers and natural language processing, it is not surprising to see that humans are trying to use computers to answer questions. By the 1960s, there were systems implemented on the two major models of question answering, IR-based and knowledge-based, to answer questions about sport statistics and scientific facts. This paper reports on the development of a knowledge-based question answering system that is aimed at providing cognitive assistance to radiologists. Our system represents the question as a semantic query to a medical knowledge base. Evidence obtained from textual and imaging data associated with the question is then combined to arrive at an answer. This question answering system has 3 stages: i) question text and answer choices processing, ii) image processing, and iii) reasoning. Currently, the system can answer differential diagnosis and patient management questions, however, we can tackle a wider variety of question types by improving our medical knowledge coverage in the future.

Research paper thumbnail of Semantic Expansion of Clinician Generated Data Preferences for Automatic Patient Data Summarization

PubMed, 2021

Patient Electronic Health Records (EHRs) typically contain a substantial amount of data, which ca... more Patient Electronic Health Records (EHRs) typically contain a substantial amount of data, which can lead to information overload for clinicians, especially in high-throughput fields like radiology. Thus, it would be beneficial to have a mechanism for summarizing the most clinically relevant patient information pertinent to the needs of clinicians. This study presents a novel approach for the curation of clinician EHR data preference information towards the ultimate goal of providing robust EHR summarization. Clinicians first provide a list of data items of interest across multiple EHR categories. Since this data is manually dictated, it has limited coverage and may not cover all the important terms relevant to a concept. To address this problem, we have developed a knowledge-driven semantic concept expansion approach by leveraging rich biomedical knowledge from the UMLS. The approach expands 1094 seed concepts to 22,325 concepts with 92.69% of the expanded concepts identified as relevant by clinicians.

Research paper thumbnail of Receptivity of an AI Cognitive Assistant by the Radiology Community: A Report on Data Collected at RSNA

Due to advances in machine learning and artificial intelligence (AI), a new role is emerging for ... more Due to advances in machine learning and artificial intelligence (AI), a new role is emerging for machines as intelligent assistants to radiologists in their clinical workflows. But what systematic clinical thought processes are these machines using? Are they similar enough to those of radiologists to be trusted as assistants? A live demonstration of such a technology was conducted at the 2016 Scientific Assembly and Annual Meeting of the Radiological Society of North America (RSNA). The demonstration was presented in the form of a question-answering system that took a radiology multiple choice question and a medical image as inputs. The AI system then demonstrated a cognitive workflow, involving text analysis, image analysis, and reasoning, to process the question and generate the most probable answer. A post demonstration survey was made available to the participants who experienced the demo and tested the question answering system. Of the reported 54,037 meeting registrants, 2,927 visited the demonstration booth, 1,991 experienced the demo, and 1,025 completed a post-demonstration survey. In this paper, the methodology of the survey is shown and a summary of its results are presented. The results of the survey show a very high level of receptiveness to cognitive computing technology and artificial intelligence among radiologists.

Research paper thumbnail of Convolutional autoencoder based model HistoCAE for segmentation of viable tumor regions in liver whole-slide images

Scientific Reports, Jan 8, 2021

Liver cancer is one of the leading causes of cancer deaths in Asia and Africa. It is caused by th... more Liver cancer is one of the leading causes of cancer deaths in Asia and Africa. It is caused by the Hepatocellular carcinoma (HCC) in almost 90% of all cases. HCC is a malignant tumor and the most common histological type of the primary liver cancers. The detection and evaluation of viable tumor regions in HCC present an important clinical significance since it is a key step to assess response of chemoradiotherapy and tumor cell proportion in genetic tests. Recent advances in computer vision, digital pathology and microscopy imaging enable automatic histopathology image analysis for cancer diagnosis. In this paper, we present a multi-resolution deep learning model HistoCAE for viable tumor segmentation in whole-slide liver histopathology images. We propose convolutional autoencoder (CAE) based framework with a customized reconstruction loss function for image reconstruction, followed by a classification module to classify each image patch as tumor versus non-tumor. The resulting patch-based prediction results are spatially combined to generate the final segmentation result for each WSI. Additionally, the spatially organized encoded feature map derived from small image patches is used to compress the gigapixel whole-slide images. Our proposed model presents superior performance to other benchmark models with extensive experiments, suggesting its efficacy for viable tumor area segmentation with liver whole-slide images. Liver is a visceral organ frequently targeted by cancer metastasis. Hepatocellular carcinoma (HCC) is the most common histological type of primary liver cancers with hepatocellular differentiation. Tumors are known to have multiple cellular and stromal components such as, tumor cells, inflammatory cells, blood vessels, acellular matrix, tumor capsule, fluid, mucin, or necrosis. The viable tumor regions are more active and responsive regions inside the tissue area. In clinical practice, tissue samples are used to assess chemoradiotherapy response rates and tumor cell proportions in genetic tests. Therefore, there is a strong but unmet need to accurately evaluate viable tumor. Pathologists often use a semi-quantitative grading system for the residual tumor burden estimation. Thanks to recent advancement in digital pathology, whole slide images (WSI) of tumor tissues can now be quantitatively and automatically analyzed with microscopy image analysis algorithms that complement traditional manual tissue examinations 1. MICCAI grand challenge 2019 has prepared a well-annotated dataset to address this specific problem 2. With the recent emergence of deep learning methods for medical image analysis, tumor segmentation in liver WSIs can be well addressed by deep learning models that conduct patch-wise classifications or pixel-wise semantic segmentation. The most popular semantic segmentation framework is FCN 3 consisting of down-sampling layers to extract image features and up-sampling layers to generate the segmentation mask. UNet 4 is another widely used segmentation model introducing the skip connections from down-sampling layers to upsampling layers to preserve the information for high-resolution images. By contrast, the patch-based methods partition the large WSIs into small image patches and classify each patch as either tumor or non-tumor 5,6. A CNN is used 7 to extract features from each patch and assign a prediction score. Based on the prediction score map in WSIs, breast metastasis cancer is segmented. As contextual information from one fixed resolution may not be enough to detect the tumor regions accurately, multiple methods have addressed this shortcoming by incorporating multi-scale contextual information into the patch-wise classification model 8. Convolutional auto-encoder (CAE) and Convolutional neural network (CNN) have been integrated for finger vein verification, where CAE

Research paper thumbnail of Application of Federated Learning in Medical Imaging

Research paper thumbnail of Semantic Expansion of Clinician Generated Data Preferences for Automatic Patient Data Summarization

AMIA ... Annual Symposium proceedings. AMIA Symposium, 2021

Patient Electronic Health Records (EHRs) typically contain a substantial amount of data, which ca... more Patient Electronic Health Records (EHRs) typically contain a substantial amount of data, which can lead to information overload for clinicians, especially in high-throughput fields like radiology. Thus, it would be beneficial to have a mechanism for summarizing the most clinically relevant patient information pertinent to the needs of clinicians. This study presents a novel approach for the curation of clinician EHR data preference information towards the ultimate goal of providing robust EHR summarization. Clinicians first provide a list of data items of interest across multiple EHR categories. Since this data is manually dictated, it has limited coverage and may not cover all the important terms relevant to a concept. To address this problem, we have developed a knowledge-driven semantic concept expansion approach by leveraging rich biomedical knowledge from the UMLS. The approach expands 1094 seed concepts to 22,325 concepts with 92.69% of the expanded concepts identified as rele...

Research paper thumbnail of Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes

Viruses, 2021

SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pand... more SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pandemic and aid public health. However, autonomous genome annotation of SARS-CoV-2 genes, proteins, and domains is not readily accomplished by existing methods and results in missing or incorrect sequences. To overcome this limitation, we developed a novel semi-supervised pipeline for automated gene, protein, and functional domain annotation of SARS-CoV-2 genomes that differentiates itself by not relying on the use of a single reference genome and by overcoming atypical genomic traits that challenge traditional bioinformatic methods. We analyzed an initial corpus of 66,000 SARS-CoV-2 genome sequences collected from labs across the world using our method and identified the comprehensive set of known proteins with 98.5% set membership accuracy and 99.1% accuracy in length prediction, compared to proteome references, including Replicase polyprotein 1ab (with its transcriptional slippage site)....

Research paper thumbnail of IBM Functional Genomics Platform - ThinkLab Horizons Science Slam

Research paper thumbnail of OMXWare, A Cloud-Based Platform for Studying Microbial Life at Scale

The rapid growth in biological sequence data is revolutionizing our understanding of genotypic di... more The rapid growth in biological sequence data is revolutionizing our understanding of genotypic diversity and challenging conventional approaches to informatics. Due to increasing availability of genomic data, traditional bioinformatic tools require substantial computational time and creation of ever larger indices each time a researcher seeks to gain insight from the data. To address these challenges, we pre-compute important relationships between biological entities and capture this information in a relational database.The database can be queried across millions of entities and returns results in a fraction of the time required by traditional methods. In this paper, we describeOMXWare, a comprehensive database relating genotype to phenotype for bacterial life. Continually updated,OMXWare today contains data derived from 200,000 curated, self-consistently assembled genomes. The database stores functional data for over 68 million genes, 52 million proteins, and 239 million domains wi...

Research paper thumbnail of Analysis and Forecasting of Global RT-PCR Primers for SARS-CoV-2

Rapid tests for active SARS-CoV-2 infections rely on reverse transcription polymerase chain react... more Rapid tests for active SARS-CoV-2 infections rely on reverse transcription polymerase chain reaction (RT-PCR). RT-PCR uses reverse transcription of RNA into complementary DNA (cDNA) and amplification of specific DNA (primer and probe) targets using polymerase chain reaction (PCR). The technology makes rapid and specific identification of the virus possible based on sequence homology of nucleic acid sequence and is much faster than tissue culture or animal cell models. However the technique can lose sensitivity over time as the virus evolves and the target sequences diverge from the selective primer sequences. Different primer sequences have been adopted in different geographic regions. As we rely on these existing RT-PCR primers to track and manage the spread of the Coronavirus, it is imperative to understand how SARS-CoV-2 mutations, over time and geographically, diverge from existing primers used today. In this study, we analyze the performance of the SARS-CoV-2 primers in use tod...

Research paper thumbnail of Generalized Extraction and Classification of Span-Level Clinical Phrases

AMIA ... Annual Symposium proceedings. AMIA Symposium, 2018

Much of the critical information in a patient's electronic health record (EHR) is hidden in u... more Much of the critical information in a patient's electronic health record (EHR) is hidden in unstructured text. As such, there is an increasing role for automated text extraction and summarization to make this information available in a way that can be quickly and easily understood. While many clinical note text extraction techniques have been examined, most existing techniques are either narrowly targeted or focus primarily on concept-level extraction, potentially missing important contextual information. In contrast, in this work we examine the extraction of several clinical categories at the phrase level, attempting to provide the necessary context while still keeping the extracted elements concise. To do so, we employ a three-stage pipeline which extracts categorized phrases of interest using clinical concepts as anchor points. Results suggest the proposed method achieves performance comparable to that of individual human annotators.

Research paper thumbnail of Visual Dialog for Radiology: Data Curation and FirstSteps

Recent work in clinical AI has been focusing on solving tasks that involve both image understandi... more Recent work in clinical AI has been focusing on solving tasks that involve both image understanding and reading comprehension. In this study, we further pursue this line of research and introduce the first Visual Dialog task in Radiology, which adds complexity to existing tasks. We present our data collection strategy for both silver and gold-standard datasets for chest x-ray images and discuss associated challenges. We evaluate a Stacked Attention Network model, commonly used for Visual Question answering in medical domain, and provide baseline results indicating the difficulty of the task.

Research paper thumbnail of Semi-supervised identification of SARS-CoV-2 molecular targets

bioRxiv, 2021

SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pand... more SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pandemic and aid public health. In this work, we analyzed a corpus of 66,000 SARS-CoV-2 genome sequences. We developed a novel semi-supervised pipeline for automated gene, protein, and functional domain annotation of SARS-CoV-2 genomes that differentiates itself by not relying on use of a single reference genome and by overcoming atypical genome traits. Using this method, we identified the comprehensive set of known proteins with 98.5% set membership accuracy and 99.1% accuracy in length prediction compared to proteome references including Replicase polyprotein 1ab (with its transcriptional slippage site). Compared to other published tools such as Prokka (base) and VAPiD, we yielded an 6.4- and 1.8-fold increase in protein annotations. Our method generated 13,000,000 molecular target sequences— some conserved across time and geography while others represent emerging variants. We observed 3,3...

Research paper thumbnail of Analysis and forecasting of global real time RT-PCR primers and probes for SARS-CoV-2

Scientific Reports, 2021

Rapid tests for active SARS-CoV-2 infections rely on reverse transcription polymerase chain react... more Rapid tests for active SARS-CoV-2 infections rely on reverse transcription polymerase chain reaction (RT-PCR). RT-PCR uses reverse transcription of RNA into complementary DNA (cDNA) and amplification of specific DNA (primer and probe) targets using polymerase chain reaction (PCR). The technology makes rapid and specific identification of the virus possible based on sequence homology of nucleic acid sequence and is much faster than tissue culture or animal cell models. However the technique can lose sensitivity over time as the virus evolves and the target sequences diverge from the selective primer sequences. Different primer sequences have been adopted in different geographic regions. As we rely on these existing RT-PCR primers to track and manage the spread of the Coronavirus, it is imperative to understand how SARS-CoV-2 mutations, over time and geographically, diverge from existing primers used today. In this study, we analyze the performance of the SARS-CoV-2 primers in use tod...

Research paper thumbnail of Functional Genomics Platform, A Cloud-Based Platform for Studying Microbial Life at Scale

IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2021

The rapid growth in biological sequence data is revolutionizing our understanding of genotypic di... more The rapid growth in biological sequence data is revolutionizing our understanding of genotypic diversity and challenging conventional approaches to informatics. With the increasing availability of genomic data, traditional bioinformatic tools require substantial computational time and the creation of ever-larger indices each time a researcher seeks to gain insight from the data. To address these challenges, we pre-computed important relationships between biological entities spanning the Central Dogma of Molecular Biology and captured this information in a relational database. The database can be queried across hundreds of millions of entities and returns results in a fraction of the time required by traditional methods. In this paper, we describe IBM Functional Genomics Platform (formerly known as OMXWare), a comprehensive database relating genotype to phenotype for bacterial life. Continually updated, IBM Functional Genomics Platform today contains data derived from 200,000 curated, self-consistently assembled genomes. The database stores functional data for over 68 million genes, 52 million proteins, and 239 million domains with associated biological activity annotations from Gene Ontology, KEGG, MetaCyc, and Reactome. IBM Functional Genomics Platform maps all of the many-to-many connections between each biological entity including the originating genome, gene, protein, and protein domain. Various microbial studies, from infectious disease to environmental health, can benefit from the rich data and connections. We describe the data selection, the pipeline to create and update the IBM Functional Genomics Platform, and the developer tools (Python SDK and REST APIs) which allow researchers to efficiently study microbial life at scale.

Research paper thumbnail of On the Role of Artificial Intelligence in Medical Imaging of COVID-19

During the COVID-19 pandemic, lung imaging takes a key role in addressing the magnified need of s... more During the COVID-19 pandemic, lung imaging takes a key role in addressing the magnified need of speed, cost, ubiquity and precision in medical care. The rise of artificial intelligence induced a quantum leap in medical imaging: AI has now proven equipollent to healthcare professionals in several diseases and the potential to save time, cost and increase coverage. But AI-accelerated medical imaging must still fully demonstrate its ability in remediating diseases such as COVID-19. We identify key use cases of lung imaging for COVID-19, comparing CT, X-Ray and ultrasound imaging from clinical and AI perspectives. We perform a systematic, manual survey of 197 related publications that reveals a disparity in the focus of the AI and clinical communities, caused by data availability and the lack of collaboration, and in modality trends, driven by ubiquity. Last, challenges in AI-acceleration and ways to remediate them are discussed and future research goals are identified.

Research paper thumbnail of Commensal or pathogen: computational vectorizing microbial genomes for<i>de novo</i>risk assessment and virulence feature discovery in<i>Klebsiella pneumoniae</i>

bioRxiv (Cold Spring Harbor Laboratory), May 13, 2024

Bacterial pathogenicity has traditionally focused on gene-level content with experimentally-confi... more Bacterial pathogenicity has traditionally focused on gene-level content with experimentally-confirmed functional properties. Hence, significant inferences are made based on similarity to known pathotypes and DNAbased genomic subtyping for risk. Herein, we achieved de novo prediction of human virulence in Klebsiella pneumoniae by expanding known virulence genes with spatially proximal gene discoveries linked by functional domain architectures across all prokaryotes. This approach identified gene ontology functions not typically associated with virulence sensu stricto. By leveraging machine learning models with these expanded discoveries, public genomes were assessed for virulence prediction using categorizations derived from isolation sources captured in available 1 Springer Nature 2021 L A T E X template 2 Commensal or pathogen: computationally vectorizing microbial genomes for de novo r metadata. Performance for de novo strain-level virulence prediction achieved 0.81 F1-Score. Virulence predictions using expanded "discovered" functional genetic content were superior to that restricted to extant virulence database content. Additionally, this approach highlighted the incongruence in relying on traditional phylogenetic subtyping for categorical inferences. Our approach represents an improved deconstruction of genome-scale datasets for functional predictions and risk assessment intended to advance public health surveillance of emerging pathogens.

Research paper thumbnail of Agent Simulation Using Path Telemetry for Modeling COVID-19 Workplace Hazard and Risk

Research paper thumbnail of Clockwork: A Discrete Event and Agent-Based Social Simulation Framework

Research Square (Research Square), Dec 13, 2023

Agent-based social simulation can be useful for creating digital twins of societies of interactin... more Agent-based social simulation can be useful for creating digital twins of societies of interacting autonomous agents. Such simulations are useful for testing hypotheses about behaviour and belief change in the presence of interventions. In this paper, we introduce Clockwork, an efficient multiagent simulation framework. The key contribution of this framework is integrating a behaviour-level simulation of autonomous agent schedules with a Discrete Event Simulator (DES) and a rich model of social interaction among agents modelled with Agent-Based Social Simulation (ABSS) methodology. Clockwork's ability to simulate deterministic and stochastic events and changes in individual agent models due to the influence of other agents through event-based emergent interactions is novel. We describe the design, architecture, and development of the Clockwork framework. Combining DES and ABSS in this way enables the modelling of detailed individual agent histories while maintaining population-level statistical distributions. We evaluated Clockwork with a real-life scenario modelling the digital twin of employees at a real-world worksite 1 Springer Nature 2021 L A T E X template 2 Clockwork: A Discrete Event and Agent-Based Social Simulation Framework to test the efficacy of various policies put in place to mitigate the risk of contracting COVID-19 through a hybrid or in-person work model.

Research paper thumbnail of Improving the Path from Diagnoses to Documentation: A Cognitive Review Tool for Clinical Notes and Administrative Records

PubMed, 2018

EMR systems are intended to improve patient-centered care management and hospital administrative ... more EMR systems are intended to improve patient-centered care management and hospital administrative processing. However, the information stored in EMRs can be disorganized, incomplete, or inconsistent, creating problems at the patient and system level. We present a technology that reconciles inconsistencies between clinical diagnoses and administrative records by analyzing free-text notes, problem lists and recorded diagnoses in real time. A fully integrated pipeline has been developed for efficient, knowledge-driven extraction, normalization, and matching of disease terms among structured and unstructured data, with modular precision of 94-98% on over 1000 patients. This cognitive data review tool improves the path from diagnosis to documentation, facilitating accurate and timely clinical and administrative decision-making.

Research paper thumbnail of A knowledge-based question answering system to provide cognitive assistance to radiologists

With the advent of computers and natural language processing, it is not surprising to see that hu... more With the advent of computers and natural language processing, it is not surprising to see that humans are trying to use computers to answer questions. By the 1960s, there were systems implemented on the two major models of question answering, IR-based and knowledge-based, to answer questions about sport statistics and scientific facts. This paper reports on the development of a knowledge-based question answering system that is aimed at providing cognitive assistance to radiologists. Our system represents the question as a semantic query to a medical knowledge base. Evidence obtained from textual and imaging data associated with the question is then combined to arrive at an answer. This question answering system has 3 stages: i) question text and answer choices processing, ii) image processing, and iii) reasoning. Currently, the system can answer differential diagnosis and patient management questions, however, we can tackle a wider variety of question types by improving our medical knowledge coverage in the future.

Research paper thumbnail of Semantic Expansion of Clinician Generated Data Preferences for Automatic Patient Data Summarization

PubMed, 2021

Patient Electronic Health Records (EHRs) typically contain a substantial amount of data, which ca... more Patient Electronic Health Records (EHRs) typically contain a substantial amount of data, which can lead to information overload for clinicians, especially in high-throughput fields like radiology. Thus, it would be beneficial to have a mechanism for summarizing the most clinically relevant patient information pertinent to the needs of clinicians. This study presents a novel approach for the curation of clinician EHR data preference information towards the ultimate goal of providing robust EHR summarization. Clinicians first provide a list of data items of interest across multiple EHR categories. Since this data is manually dictated, it has limited coverage and may not cover all the important terms relevant to a concept. To address this problem, we have developed a knowledge-driven semantic concept expansion approach by leveraging rich biomedical knowledge from the UMLS. The approach expands 1094 seed concepts to 22,325 concepts with 92.69% of the expanded concepts identified as relevant by clinicians.

Research paper thumbnail of Receptivity of an AI Cognitive Assistant by the Radiology Community: A Report on Data Collected at RSNA

Due to advances in machine learning and artificial intelligence (AI), a new role is emerging for ... more Due to advances in machine learning and artificial intelligence (AI), a new role is emerging for machines as intelligent assistants to radiologists in their clinical workflows. But what systematic clinical thought processes are these machines using? Are they similar enough to those of radiologists to be trusted as assistants? A live demonstration of such a technology was conducted at the 2016 Scientific Assembly and Annual Meeting of the Radiological Society of North America (RSNA). The demonstration was presented in the form of a question-answering system that took a radiology multiple choice question and a medical image as inputs. The AI system then demonstrated a cognitive workflow, involving text analysis, image analysis, and reasoning, to process the question and generate the most probable answer. A post demonstration survey was made available to the participants who experienced the demo and tested the question answering system. Of the reported 54,037 meeting registrants, 2,927 visited the demonstration booth, 1,991 experienced the demo, and 1,025 completed a post-demonstration survey. In this paper, the methodology of the survey is shown and a summary of its results are presented. The results of the survey show a very high level of receptiveness to cognitive computing technology and artificial intelligence among radiologists.

Research paper thumbnail of Convolutional autoencoder based model HistoCAE for segmentation of viable tumor regions in liver whole-slide images

Scientific Reports, Jan 8, 2021

Liver cancer is one of the leading causes of cancer deaths in Asia and Africa. It is caused by th... more Liver cancer is one of the leading causes of cancer deaths in Asia and Africa. It is caused by the Hepatocellular carcinoma (HCC) in almost 90% of all cases. HCC is a malignant tumor and the most common histological type of the primary liver cancers. The detection and evaluation of viable tumor regions in HCC present an important clinical significance since it is a key step to assess response of chemoradiotherapy and tumor cell proportion in genetic tests. Recent advances in computer vision, digital pathology and microscopy imaging enable automatic histopathology image analysis for cancer diagnosis. In this paper, we present a multi-resolution deep learning model HistoCAE for viable tumor segmentation in whole-slide liver histopathology images. We propose convolutional autoencoder (CAE) based framework with a customized reconstruction loss function for image reconstruction, followed by a classification module to classify each image patch as tumor versus non-tumor. The resulting patch-based prediction results are spatially combined to generate the final segmentation result for each WSI. Additionally, the spatially organized encoded feature map derived from small image patches is used to compress the gigapixel whole-slide images. Our proposed model presents superior performance to other benchmark models with extensive experiments, suggesting its efficacy for viable tumor area segmentation with liver whole-slide images. Liver is a visceral organ frequently targeted by cancer metastasis. Hepatocellular carcinoma (HCC) is the most common histological type of primary liver cancers with hepatocellular differentiation. Tumors are known to have multiple cellular and stromal components such as, tumor cells, inflammatory cells, blood vessels, acellular matrix, tumor capsule, fluid, mucin, or necrosis. The viable tumor regions are more active and responsive regions inside the tissue area. In clinical practice, tissue samples are used to assess chemoradiotherapy response rates and tumor cell proportions in genetic tests. Therefore, there is a strong but unmet need to accurately evaluate viable tumor. Pathologists often use a semi-quantitative grading system for the residual tumor burden estimation. Thanks to recent advancement in digital pathology, whole slide images (WSI) of tumor tissues can now be quantitatively and automatically analyzed with microscopy image analysis algorithms that complement traditional manual tissue examinations 1. MICCAI grand challenge 2019 has prepared a well-annotated dataset to address this specific problem 2. With the recent emergence of deep learning methods for medical image analysis, tumor segmentation in liver WSIs can be well addressed by deep learning models that conduct patch-wise classifications or pixel-wise semantic segmentation. The most popular semantic segmentation framework is FCN 3 consisting of down-sampling layers to extract image features and up-sampling layers to generate the segmentation mask. UNet 4 is another widely used segmentation model introducing the skip connections from down-sampling layers to upsampling layers to preserve the information for high-resolution images. By contrast, the patch-based methods partition the large WSIs into small image patches and classify each patch as either tumor or non-tumor 5,6. A CNN is used 7 to extract features from each patch and assign a prediction score. Based on the prediction score map in WSIs, breast metastasis cancer is segmented. As contextual information from one fixed resolution may not be enough to detect the tumor regions accurately, multiple methods have addressed this shortcoming by incorporating multi-scale contextual information into the patch-wise classification model 8. Convolutional auto-encoder (CAE) and Convolutional neural network (CNN) have been integrated for finger vein verification, where CAE

Research paper thumbnail of Application of Federated Learning in Medical Imaging

Research paper thumbnail of Semantic Expansion of Clinician Generated Data Preferences for Automatic Patient Data Summarization

AMIA ... Annual Symposium proceedings. AMIA Symposium, 2021

Patient Electronic Health Records (EHRs) typically contain a substantial amount of data, which ca... more Patient Electronic Health Records (EHRs) typically contain a substantial amount of data, which can lead to information overload for clinicians, especially in high-throughput fields like radiology. Thus, it would be beneficial to have a mechanism for summarizing the most clinically relevant patient information pertinent to the needs of clinicians. This study presents a novel approach for the curation of clinician EHR data preference information towards the ultimate goal of providing robust EHR summarization. Clinicians first provide a list of data items of interest across multiple EHR categories. Since this data is manually dictated, it has limited coverage and may not cover all the important terms relevant to a concept. To address this problem, we have developed a knowledge-driven semantic concept expansion approach by leveraging rich biomedical knowledge from the UMLS. The approach expands 1094 seed concepts to 22,325 concepts with 92.69% of the expanded concepts identified as rele...

Research paper thumbnail of Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes

Viruses, 2021

SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pand... more SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pandemic and aid public health. However, autonomous genome annotation of SARS-CoV-2 genes, proteins, and domains is not readily accomplished by existing methods and results in missing or incorrect sequences. To overcome this limitation, we developed a novel semi-supervised pipeline for automated gene, protein, and functional domain annotation of SARS-CoV-2 genomes that differentiates itself by not relying on the use of a single reference genome and by overcoming atypical genomic traits that challenge traditional bioinformatic methods. We analyzed an initial corpus of 66,000 SARS-CoV-2 genome sequences collected from labs across the world using our method and identified the comprehensive set of known proteins with 98.5% set membership accuracy and 99.1% accuracy in length prediction, compared to proteome references, including Replicase polyprotein 1ab (with its transcriptional slippage site)....

Research paper thumbnail of IBM Functional Genomics Platform - ThinkLab Horizons Science Slam

Research paper thumbnail of OMXWare, A Cloud-Based Platform for Studying Microbial Life at Scale

The rapid growth in biological sequence data is revolutionizing our understanding of genotypic di... more The rapid growth in biological sequence data is revolutionizing our understanding of genotypic diversity and challenging conventional approaches to informatics. Due to increasing availability of genomic data, traditional bioinformatic tools require substantial computational time and creation of ever larger indices each time a researcher seeks to gain insight from the data. To address these challenges, we pre-compute important relationships between biological entities and capture this information in a relational database.The database can be queried across millions of entities and returns results in a fraction of the time required by traditional methods. In this paper, we describeOMXWare, a comprehensive database relating genotype to phenotype for bacterial life. Continually updated,OMXWare today contains data derived from 200,000 curated, self-consistently assembled genomes. The database stores functional data for over 68 million genes, 52 million proteins, and 239 million domains wi...

Research paper thumbnail of Analysis and Forecasting of Global RT-PCR Primers for SARS-CoV-2

Rapid tests for active SARS-CoV-2 infections rely on reverse transcription polymerase chain react... more Rapid tests for active SARS-CoV-2 infections rely on reverse transcription polymerase chain reaction (RT-PCR). RT-PCR uses reverse transcription of RNA into complementary DNA (cDNA) and amplification of specific DNA (primer and probe) targets using polymerase chain reaction (PCR). The technology makes rapid and specific identification of the virus possible based on sequence homology of nucleic acid sequence and is much faster than tissue culture or animal cell models. However the technique can lose sensitivity over time as the virus evolves and the target sequences diverge from the selective primer sequences. Different primer sequences have been adopted in different geographic regions. As we rely on these existing RT-PCR primers to track and manage the spread of the Coronavirus, it is imperative to understand how SARS-CoV-2 mutations, over time and geographically, diverge from existing primers used today. In this study, we analyze the performance of the SARS-CoV-2 primers in use tod...

Research paper thumbnail of Generalized Extraction and Classification of Span-Level Clinical Phrases

AMIA ... Annual Symposium proceedings. AMIA Symposium, 2018

Much of the critical information in a patient's electronic health record (EHR) is hidden in u... more Much of the critical information in a patient's electronic health record (EHR) is hidden in unstructured text. As such, there is an increasing role for automated text extraction and summarization to make this information available in a way that can be quickly and easily understood. While many clinical note text extraction techniques have been examined, most existing techniques are either narrowly targeted or focus primarily on concept-level extraction, potentially missing important contextual information. In contrast, in this work we examine the extraction of several clinical categories at the phrase level, attempting to provide the necessary context while still keeping the extracted elements concise. To do so, we employ a three-stage pipeline which extracts categorized phrases of interest using clinical concepts as anchor points. Results suggest the proposed method achieves performance comparable to that of individual human annotators.

Research paper thumbnail of Visual Dialog for Radiology: Data Curation and FirstSteps

Recent work in clinical AI has been focusing on solving tasks that involve both image understandi... more Recent work in clinical AI has been focusing on solving tasks that involve both image understanding and reading comprehension. In this study, we further pursue this line of research and introduce the first Visual Dialog task in Radiology, which adds complexity to existing tasks. We present our data collection strategy for both silver and gold-standard datasets for chest x-ray images and discuss associated challenges. We evaluate a Stacked Attention Network model, commonly used for Visual Question answering in medical domain, and provide baseline results indicating the difficulty of the task.

Research paper thumbnail of Semi-supervised identification of SARS-CoV-2 molecular targets

bioRxiv, 2021

SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pand... more SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pandemic and aid public health. In this work, we analyzed a corpus of 66,000 SARS-CoV-2 genome sequences. We developed a novel semi-supervised pipeline for automated gene, protein, and functional domain annotation of SARS-CoV-2 genomes that differentiates itself by not relying on use of a single reference genome and by overcoming atypical genome traits. Using this method, we identified the comprehensive set of known proteins with 98.5% set membership accuracy and 99.1% accuracy in length prediction compared to proteome references including Replicase polyprotein 1ab (with its transcriptional slippage site). Compared to other published tools such as Prokka (base) and VAPiD, we yielded an 6.4- and 1.8-fold increase in protein annotations. Our method generated 13,000,000 molecular target sequences— some conserved across time and geography while others represent emerging variants. We observed 3,3...

Research paper thumbnail of Analysis and forecasting of global real time RT-PCR primers and probes for SARS-CoV-2

Scientific Reports, 2021

Rapid tests for active SARS-CoV-2 infections rely on reverse transcription polymerase chain react... more Rapid tests for active SARS-CoV-2 infections rely on reverse transcription polymerase chain reaction (RT-PCR). RT-PCR uses reverse transcription of RNA into complementary DNA (cDNA) and amplification of specific DNA (primer and probe) targets using polymerase chain reaction (PCR). The technology makes rapid and specific identification of the virus possible based on sequence homology of nucleic acid sequence and is much faster than tissue culture or animal cell models. However the technique can lose sensitivity over time as the virus evolves and the target sequences diverge from the selective primer sequences. Different primer sequences have been adopted in different geographic regions. As we rely on these existing RT-PCR primers to track and manage the spread of the Coronavirus, it is imperative to understand how SARS-CoV-2 mutations, over time and geographically, diverge from existing primers used today. In this study, we analyze the performance of the SARS-CoV-2 primers in use tod...

Research paper thumbnail of Functional Genomics Platform, A Cloud-Based Platform for Studying Microbial Life at Scale

IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2021

The rapid growth in biological sequence data is revolutionizing our understanding of genotypic di... more The rapid growth in biological sequence data is revolutionizing our understanding of genotypic diversity and challenging conventional approaches to informatics. With the increasing availability of genomic data, traditional bioinformatic tools require substantial computational time and the creation of ever-larger indices each time a researcher seeks to gain insight from the data. To address these challenges, we pre-computed important relationships between biological entities spanning the Central Dogma of Molecular Biology and captured this information in a relational database. The database can be queried across hundreds of millions of entities and returns results in a fraction of the time required by traditional methods. In this paper, we describe IBM Functional Genomics Platform (formerly known as OMXWare), a comprehensive database relating genotype to phenotype for bacterial life. Continually updated, IBM Functional Genomics Platform today contains data derived from 200,000 curated, self-consistently assembled genomes. The database stores functional data for over 68 million genes, 52 million proteins, and 239 million domains with associated biological activity annotations from Gene Ontology, KEGG, MetaCyc, and Reactome. IBM Functional Genomics Platform maps all of the many-to-many connections between each biological entity including the originating genome, gene, protein, and protein domain. Various microbial studies, from infectious disease to environmental health, can benefit from the rich data and connections. We describe the data selection, the pipeline to create and update the IBM Functional Genomics Platform, and the developer tools (Python SDK and REST APIs) which allow researchers to efficiently study microbial life at scale.

Research paper thumbnail of On the Role of Artificial Intelligence in Medical Imaging of COVID-19

During the COVID-19 pandemic, lung imaging takes a key role in addressing the magnified need of s... more During the COVID-19 pandemic, lung imaging takes a key role in addressing the magnified need of speed, cost, ubiquity and precision in medical care. The rise of artificial intelligence induced a quantum leap in medical imaging: AI has now proven equipollent to healthcare professionals in several diseases and the potential to save time, cost and increase coverage. But AI-accelerated medical imaging must still fully demonstrate its ability in remediating diseases such as COVID-19. We identify key use cases of lung imaging for COVID-19, comparing CT, X-Ray and ultrasound imaging from clinical and AI perspectives. We perform a systematic, manual survey of 197 related publications that reveals a disparity in the focus of the AI and clinical communities, caused by data availability and the lack of collaboration, and in modality trends, driven by ubiquity. Last, challenges in AI-acceleration and ways to remediate them are discussed and future research goals are identified.