Trupti Joshi | University of Missouri Columbia (original) (raw)

Papers by Trupti Joshi

Research paper thumbnail of A multi-omics informatics approach for identifying molecular mechanisms and biomarkers in clinical patients with endometriosis

2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

Endometriosis is a complex gynecological disorder. The diagnostic process of endometriosis involv... more Endometriosis is a complex gynecological disorder. The diagnostic process of endometriosis involves an invasive procedure thus delaying the diagnosis for about 10 years on average. Both DNA-methylation data and RNA-seq data has the potential to uncover molecular mechanisms of diseases. The objective of this project is to identify diagnostic molecular mechanisms of endometriosis using a multi-omics approach that will lead to noninvasive diagnostic procedure.

Research paper thumbnail of Mutational Forks: Inferring Deregulated Flow of Signal Transduction Based on Patient-Specific Mutations

2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

The precise mechanism behind treatment resistance in cancer is still not fully understood. Despit... more The precise mechanism behind treatment resistance in cancer is still not fully understood. Despite advances in precision oncology, there is a lack of tools that help to understand a mechanistic picture of treatment resistance in cancer patients. Existing enrichment methods heavily rely on quantitative data and limited to analysis of differentially expressed genes, ignoring crucial players that might be involved in this process. In order to tackle treatment resistance, the precise identification of deregulated flow of signal transduction is critical. Here, we introduce a bioinformatics framework that is capable of inferring deregulated flow of signal transduction given evidence-based knowledge about pathway topology and patient-specific mutations. While testing the proposed pipeline on a case study, our algorithm was able to confirm findings from biological experiment, where KRAS mutant cells developed treatment resistance to MEK inhibitor. Our model provides a framework for mechanistic understanding of acquired treatment resistance, thus, equipped clinicians with tool for searching more accurate diagnostic clues in patients with non-trivial disease representations.

Research paper thumbnail of Domain-specific Topic Model for Knowledge Discovery in Computational and Data-Intensive Scientific Communities

IEEE Transactions on Knowledge and Data Engineering

Shortened time to knowledge discovery and adapting prior domain knowledge is a challenge for comp... more Shortened time to knowledge discovery and adapting prior domain knowledge is a challenge for computational and dataintensive communities such as e.g., bioinformatics and neuroscience. The challenge for a domain scientist lies in the actions to obtain guidance through query of massive information from diverse text corpus comprising of a wide-ranging set of topics when: investigating new methods, developing new tools, or integrating datasets. In this paper, we propose a novel "domain-specific topic model" (DSTM) to discover latent knowledge patterns about relationships among research topics, tools and datasets from exemplary scientific domains. Our DSTM is a generative model that extends the Latent Dirichlet Allocation (LDA) model and uses the Markov chain Monte Carlo (MCMC) algorithm to infer latent patterns within a specific domain in an unsupervised manner. We apply our DSTM to large collections of data from bioinformatics and neuroscience domains that include more than 25,000 of papers over the last ten years, featuring hundreds of tools and datasets that are commonly used in relevant studies. Evaluation experiments based on generalization and information retrieval metrics show that our model has better performance than the state-of-the-art baseline models for discovering highly-specific latent topics within a domain. Lastly, we demonstrate applications that benefit from our DSTM to discover intra-domain, cross-domain and trend knowledge patterns.

Research paper thumbnail of Fuzzy-Engineered Multi-Cloud Resource Brokering for Data-intensive Applications

2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Multi-cloud resource brokering is becoming a critical requirement for applications that require h... more Multi-cloud resource brokering is becoming a critical requirement for applications that require high scale, diversity, and resilience. Applications demand timely selection of distributed data storage and computation platforms that span local private cloud resources as well as resources from multiple cloud service providers (CSPs). The distinct capabilities and policies, as well as performance/cost of the cloud services, are amongst the prime factors for CSP selection. However, application owners who need suitable cyber resources in community/public clouds, often have preliminary knowledge and preferences of certain CSPs. They also lack expert guidance to handle the problem of overwhelming resource choice from CSPs, and optimization to compensate for service dynamics. In this paper, we address this challenge of optimal resource selection while also leveraging limited user's expertise and preferences towards CSPs through multi-level fuzzy logic modeling based on convoluted factors of performance, agility, cost, and security. We evaluate the efficiency of our fuzzy-engineered resource brokering in improving allocation of resources as well as user satisfiability by using case studies and independent validations of CSPs evaluation.

Research paper thumbnail of Application of SNPViz v2.0 using next-generation sequencing data sets in the discovery of potential causative mutations in candidate genes associated with phenotypes

International Journal of Data Mining and Bioinformatics

Single Nucleotide Polymorphisms (SNPs) and insertions/deletions (Indels) are the most common biol... more Single Nucleotide Polymorphisms (SNPs) and insertions/deletions (Indels) are the most common biological markers widely spread across all genome chromosomes. Owing to the large amount of SNPs and Indels data that have become available during the last ten years, it is a challenge to intuitively integrate, compare, or visualise them and effectively perform analysis across multiple samples simultaneously. Genome-Wide Association Studies (GWAS) is an approach to find genetic variants associated with a trait, but it lacks an efficient way of investigating genomic variant functions. To tackle these issues, we developed SNPViz v2.0, a web-based tool designed to visualise large-scale haplotype blocks with detailed SNPs and Indels grouped by their chromosomal coordinates, along with their overlapping gene models, phenotype to genotype accuracies, Gene Ontology (GO), protein families (Pfam), and their functional effects. SNPViz v2.0 is available in both SoyKB and KBCommons. For soya bean only, the SNPViz v2.

Research paper thumbnail of Security-aware Resource Brokering for Bioinformatics Workflows across Federated Multi-cloud Infrastructures

Proceedings of the 21st International Conference on Distributed Computing and Networking, 2020

Data-intensive science applications often use federated multi-cloud infrastructures to support th... more Data-intensive science applications often use federated multi-cloud infrastructures to support their compute-intensive processing needs. However, lack of knowledge about: a) individual domain's security policies, b) how that translates to application security assurance, and c) nature of performance and security trade-offs-can cause performance-security conflicts for applications and inefficient resource usage. In this paper, we propose a security-aware resource brokering middleware framework to allocate application resources by satisfying their performance and security requirements. The proposed middleware implements MCPS (Multi-Cloud Performance and Security) Broker that uses a common data model to represent applications' performance and security requirements. It performs a security-aware global scheduling to choose the optimal cloud domain, and a local scheduling to choose the optimal server within the chosen cloud domain. Using real SoyKB application workflows, we implement the proposed MCPS Broker in the GENI Cloud and demonstrate its utility through a NIST-guided risk assessment. CCS CONCEPTS • Security and privacy → Security requirements; • Networks → Cloud computing; Network resources allocation.

Research paper thumbnail of A Formative Usability Study to Improve Prescriptive Systems for Bioinformatics Big Data

2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2020

Big data computation tools are vital for researchers and educators from various domains such as p... more Big data computation tools are vital for researchers and educators from various domains such as plant science, animal science, biomedical science and others. With the growing computational complexity of biology big data, advanced analytic systems, known as prescriptive systems, are being built using machine learning models to intelligently predict optimum computation solutions for users for better data analysis. However, lack of user-friendly prescriptive systems poses a critical roadblock to facilitating informed decision-making by users. In this paper, we detail a formative usability study to address the complexities faced by users while using prescriptive systems. Our usability research approach considers bioinformatics workflows and community cloud resources in the KBCommons framework's science gateway. The results show that recommendations from usability studies performed in iterations during the development of prescriptive systems can improve user experience, user satisfaction and help novice as well as expert users to make decisions in a well-informed manner.

Research paper thumbnail of Domain-specific Topic Model for Knowledge Discovery through Conversational Agents in Data Intensive Scientific Communities

2018 IEEE International Conference on Big Data (Big Data), 2018

Machine learning techniques underlying Big Data analytics have the potential to benefit data inte... more Machine learning techniques underlying Big Data analytics have the potential to benefit data intensive communities in e.g., bioinformatics and neuroscience domain sciences. Today’s innovative advances in these domain communities are increasingly built upon multi-disciplinary knowledge discovery and cross-domain collaborations. Consequently, shortened time to knowledge discovery is a challenge when investigating new methods, developing new tools, or integrating datasets. The challenge for a domain scientist particularly lies in the actions to obtain guidance through query of massive information from diverse text corpus comprising of a wide-ranging set of topics. In this paper, we propose a novel "domain-specific topic model" (DSTM) that can drive conversational agents for users to discover latent knowledge patterns about relationships among research topics, tools and datasets from exemplar scientific domains. The goal of DSTM is to perform data mining to obtain meaningful guidance via a chatbot for domain scientists to choose the relevant tools or datasets pertinent to solving a computational and data intensive research problem at hand. Our DSTM is a Bayesian hierarchical model that extends the Latent Dirichlet Allocation (LDA) model and uses a Markov chain Monte Carlo algorithm to infer latent patterns within a specific domain in an unsupervised manner. We apply our DSTM to large collections of data from bioinformatics and neuroscience domains that include hundreds of papers from reputed journal archives, hundreds of tools and datasets. Through evaluation experiments with a perplexity metric, we show that our model has better generalization performance within a domain for discovering highly specific latent topics.

Research paper thumbnail of Community cloud architecture to improve use accessibility with security compliance in health big data applications

Proceedings of the 20th International Conference on Distributed Computing and Networking, 2019

The adoption of big data analytics in healthcare applications is overwhelming not only because of... more The adoption of big data analytics in healthcare applications is overwhelming not only because of the huge volume of data being analyzed, but also because of the heterogeneity and sensitivity of the data. Eective and ecient analysis and visualization of secure patient health records are needed to e.g., nd new trends in disease management, determining risk factors for diseases, and personalized medicine. In this paper, we propose a novel community cloud architecture to help clinicians and researchers to have easy/increased accessibility to data sets from multiple sources, while also ensuring security compliance of data providers is not compromised. Our cloud-based system design conguration with cloudlet principles ensures application performance has high-speed processing, and data analytics is suciently scalable while adhering to security standards (e.g., HIPAA, NIST). Through a case study, we show how our community cloud architecture can be implemented along with best practices in an ophthalmology case study which includes health big data (i.e., Health Facts database, I2B2, Millennium) hosted in a campus cloud infrastructure featuring virtual desktop thin-clients and relevant Data Classication Levels in storage.

Research paper thumbnail of Inductive Inference of Gene Regulatory Network Using Supervised and Semi-supervised Graph Neural Networks

Discovering gene regulatory relationships and reconstructing gene regulatory networks (GRN) based... more Discovering gene regulatory relationships and reconstructing gene regulatory networks (GRN) based on gene expression data is a classical, long-standing computational challenge in bioinformatics. Computationally inferring a possible regulatory relationship between two genes can be formulated as a link prediction problem between two nodes in a graph. Graph neural network (GNN) provides an opportunity to construct GRN by integrating topological neighbor propagation through the whole gene network. We propose an end-to-end gene regulatory graph neural network (GRGNN) approach to reconstruct GRNs from scratch utilizing the gene expression data, in both a supervised and a semi-supervised framework. To get better inductive generalization capability, GRN inference is formulated as a graph classification problem, to distinguish whether a subgraph centered at two nodes contains the link between the two nodes. A linked pair between a transcription factor (TF) and a target gene, and their neighb...

Research paper thumbnail of Knowledge Base Commons (KBCommons) v1.1: a universal framework for multi-omics data integration and biological discoveries

BMC Genomics, 2019

Background Knowledge Base Commons (KBCommons) v1.1 is a universal and all-inclusive web-based fra... more Background Knowledge Base Commons (KBCommons) v1.1 is a universal and all-inclusive web-based framework providing generic functionalities for storing, sharing, analyzing, exploring, integrating and visualizing multiple organisms’ genomics and integrative omics data. KBCommons is designed and developed to integrate diverse multi-level omics data and to support biological discoveries for all species via a common platform. Methods KBCommons has four modules including data storage, data processing, data accessing, and web interface for data management and retrieval. It provides a comprehensive framework for new plant-specific, animal-specific, virus-specific, bacteria-specific or human disease-specific knowledge base (KB) creation, for adding new genome versions and additional multi-omics data to existing KBs, and for exploring existing datasets within current KBs. Results KBCommons has an array of tools for data visualization and data analytics such as multiple gene/metabolite search, ...

Research paper thumbnail of Machine Learning Classifiers for Endometriosis Using Transcriptomics and Methylomics Data

Frontiers in Genetics, 2019

Research paper thumbnail of PGen: large-scale genomic variations analysis workflow and browser in SoyKB

BMC Bioinformatics, 2016

Background: With the advances in next-generation sequencing (NGS) technology and significant redu... more Background: With the advances in next-generation sequencing (NGS) technology and significant reductions in sequencing costs, it is now possible to sequence large collections of germplasm in crops for detecting genome-scale genetic variations and to apply the knowledge towards improvements in traits. To efficiently facilitate large-scale NGS resequencing data analysis of genomic variations, we have developed "PGen", an integrated and optimized workflow using the Extreme Science and Engineering Discovery Environment (XSEDE) high-performance computing (HPC) virtual system, iPlant cloud data storage resources and Pegasus workflow management system (Pegasus-WMS). The workflow allows users to identify single nucleotide polymorphisms (SNPs) and insertion-deletions (indels), perform SNP annotations and conduct copy number variation analyses on multiple resequencing datasets in a user-friendly and seamless way. Results: We have developed both a Linux version in GitHub (https://github.com/pegasus-isi/PGen-GenomicVariations-Workflow) and a web-based implementation of the PGen workflow integrated within the Soybean Knowledge Base (SoyKB), (http://soykb.org/Pegasus/index.php). Using PGen, we identified 10,218,140 single-nucleotide polymorphisms (SNPs) and 1,398,982 indels from analysis of 106 soybean lines sequenced at 15X coverage. 297,245 non-synonymous SNPs and 3330 copy number variation (CNV) regions were identified from this analysis. SNPs identified using PGen from additional soybean resequencing projects adding to 500+ soybean germplasm lines in total have been integrated. These SNPs are being utilized for trait improvement using genotype to phenotype prediction approaches developed in-house. In order to browse and access NGS data easily, we have also developed an NGS resequencing data browser (http://soykb.org/NGS_Resequence/NGS_index.php) within SoyKB to provide easy access to SNP and downstream analysis results for soybean researchers. Conclusion: PGen workflow has been optimized for the most efficient analysis of soybean data using thorough testing and validation. This research serves as an example of best practices for development of genomics data analysis workflows by integrating remote HPC resources and efficient data management with ease of use for biological users. PGen workflow can also be easily customized for analysis of data in other species.

Research paper thumbnail of RDF Sketch Maps - Knowledge Complexity Reduction for Precision Medicine Analytics

Biocomputing 2016, 2015

Realization of precision medicine ideas requires significant research effort to be able to spot s... more Realization of precision medicine ideas requires significant research effort to be able to spot subtle differences in complex diseases at the molecular level to develop personalized therapies. It is especially important in many cases of highly heterogeneous cancers. Precision diagnostics and therapeutics of such diseases demands interrogation of vast amounts of biological knowledge coupled with novel analytic methodologies. For instance, pathway-based approaches can shed light on the way tumorigenesis takes place in individual patient cases and pinpoint to novel drug targets. However, comprehensive analysis of hundreds of pathways and thousands of genes creates a combinatorial explosion, that is challenging for medical practitioners to handle at the point of care. Here we extend our previous work on mapping clinical omics data to curated Resource Description Framework (RDF) knowledge bases to derive influence diagrams of interrelationships of biomarker proteins, diseases and signal transduction pathways for personalized theranostics. We present RDF Sketch Maps-a computational method to reduce knowledge complexity for precision medicine analytics. The method of RDF Sketch Maps is inspired by the way a sketch artist conveys only important visual information and discards other unnecessary details. In our case, we compute and retain only so-called RDF Edges-places with highly important diagnostic and therapeutic information. To do this we utilize 35 maps of human signal transduction pathways by transforming 300 KEGG maps into highly processable RDF knowledge base. We have demonstrated potential clinical utility of RDF Sketch Maps in hematopoietic cancers, including analysis of pathways associated with Hairy Cell Leukemia (HCL) and Chronic Myeloid Leukemia (CML) where we achieved up to 20-fold reduction in the number of biological entities to be analyzed, while retaining most likely important entities. In experiments with pathways associated with HCL a generated RDF Sketch Map of the top 30% paths retained important information about signaling cascades leading to activation of proto-oncogene BRAF, which is usually associated with a different cancer, melanoma. Recent reports of successful treatments of HCL patients by the BRAF-targeted drug vemurafenib support the validity of the RDF Sketch Maps findings. We therefore believe that RDF Sketch Maps will be invaluable for hypothesis generation for precision diagnostics and therapeutics as well as drug repurposing studies.

Research paper thumbnail of Virtual physical examination (VPE): a multimedia system for education in medicine

International Journal of Functional Informatics and Personalised Medicine, 2014

The virtual physical examination (VPE) platform is a web-based, multimedia system for medical exa... more The virtual physical examination (VPE) platform is a web-based, multimedia system for medical examination education, distributed and supported by Cerner. The system was built using MySQL, Flash and PHP and developed for creating and conducting physical examinations on virtual patient cases in a simulated environment. VPE allows users to perform physical examinations virtually on patients. A user can create 3D avatars of patients, build medical cases, perform diagnosis and attach associated files including audio, video, image, text and interactive assets. VPE allows for controlled sharing of assets as well as full cases, either within the same organisation or publicly. A video demonstrating VPE can be found at http://digbio.missouri. edu/Cerner/Cerner_VPE_Demo.mp4. VPE has broad educational applications ranging from the basic introduction of medicine to high school students to advanced education for nursing and medical students. The solution is used to both teach and reinforce physical exam concepts. VPE is available at https://vpe.cernerlearningmanager.com.

Research paper thumbnail of A Linear Programming Framework for Inferring Gene Regulatory Networks by Integrating Heterogeneous Data

Research paper thumbnail of SoyMetDB: The soybean metabolome database

2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2010

Research paper thumbnail of Identification and evaluation of quantitative trait loci underlying resistance to multiple HG types of soybean cyst nematode in soybean PI 437655

Theoretical and Applied Genetics, 2014

SCN resistance in PI 437655, and to evaluate the QTL for their contribution to SCN resistance. Tw... more SCN resistance in PI 437655, and to evaluate the QTL for their contribution to SCN resistance. Two F 6:7 recombinant inbred line populations, derived from cv. Williams 82 × PI 437655 and cv. Hutcheson × PI 437655 crosses, were evaluated for resistance to SCN HG types 1.2.5.7 (PA2), 0 (PA3), 1.3.5.6.7 (PA14), and 1.2.3.4.5.6.7 (LY2). The 1,536 SNP array was used to genotype the mapping populations and construct genetic linkage maps. Two significant QTL were consistently mapped on chromosomes (Chr.) 18 and 20 in these two populations. One QTL on Chr. 18, which corresponds to the known Rhg1 locus, contributed resistance to SCN HG types 1.

Research paper thumbnail of ADON: Application-driven Overlay Network-as-a-Service for data-intensive science

2014 IEEE 3rd International Conference on Cloud Networking (CloudNet), 2014

Campuses are increasingly adopting hybrid cloud architectures for supporting data-intensive scien... more Campuses are increasingly adopting hybrid cloud architectures for supporting data-intensive science applications that require "on-demand" resources, which are not always available locally on-site. Policies at the campus edge for handling multiple such applications competing for remote resources can cause bottlenecks across applications. These bottlenecks can be proactively avoided with pertinent profiling, monitoring and control of application flows using software-defined networking principles. In this paper, we present an "Application-driven Overlay Network-as-a-Service" (ADON) that can manage the hybrid cloud requirements of multiple applications in a scalable and extensible manner using features such as: programmable "custom templates" and a "virtual tenant handler". Our solution approach involves scheduling transit selection and traffic engineering at the campus-edge based on real-time policy control that ensures predictable application performance delivery for multi-tenant traffic profiles. We validate our ADON approach with an implementation on a wide-area overlay network testbed across two campuses, and present a workflow that eases the orchestration of network programmability for campus network providers and data-intensive application users. Lastly, we present an emulation study of the ADON effectiveness in handling temporal behavior of multi-tenant traffic burst arrivals using profiles from a diverse set of actual data-intensive applications. I. INTRODUCTION Data-intensive applications in research fields such as bioinformatics, climate modeling, particle physics and genomics generate vast amounts of data that need to be processed with real-time analysis. The general data processing facilities and specialized compute resources do not always reside at the data generation sites on campus, and data is frequently transferred in real-time to geographically distributed sites (e.g., remote instrumentation site, federated data repository, public cloud) over wide-area networks. Moreover, researchers share workflows of their data-intensive applications with remote collaborators for multidisciplinary initiatives on multi-domain physical networks [1]. Current campus network infrastructures place stringent security policies at the edge router/switch and install firewalls to defend the campus local-area network (LAN) from potential cyber attacks. Such defense mechanisms significantly impact research traffic especially in the case of data-intensive science applications whose flows traverse wide-area network (WAN) paths. This has prompted campuses to build Science DMZs (de-militarized zones) [1] with high-speed (1-100 Gbps) programmable networks to provide dedicated network infrastructures for research traffic flows that need to be handled in parallel to the regular enterprise traffic.

Research paper thumbnail of A critical assessment of Mus musculus gene function prediction using integrated genomic evidence

Genome Biology, 2008

Background: Several years after sequencing the human genome and the mouse genome, much remains to... more Background: Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated. Results: In this study, a standardized collection of mouse functional genomic data was assembled; nine bioinformatics teams used this data set to independently train classifiers and generate predictions of function, as defined by Gene Ontology (GO) terms, for 21,603 mouse genes; and the best performing submissions were combined in a single set of predictions. We identified strengths and weaknesses of current functional genomic data sets and compared the performance of function prediction algorithms. This analysis inferred functions for 76% of mouse genes, including 5,000 currently uncharacterized genes. At a recall rate of 20%, a unified set of predictions averaged 41% precision, with 26% of GO terms achieving a precision better than 90%. Conclusion: We performed a systematic evaluation of diverse, independently developed computational approaches for predicting gene function from heterogeneous data sources in mammals. The results show that currently available data for mammals allows predictions with both breadth and accuracy. Importantly, many highly novel predictions emerge for the 38% of mouse genes that remain uncharacterized.

Research paper thumbnail of A multi-omics informatics approach for identifying molecular mechanisms and biomarkers in clinical patients with endometriosis

2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

Endometriosis is a complex gynecological disorder. The diagnostic process of endometriosis involv... more Endometriosis is a complex gynecological disorder. The diagnostic process of endometriosis involves an invasive procedure thus delaying the diagnosis for about 10 years on average. Both DNA-methylation data and RNA-seq data has the potential to uncover molecular mechanisms of diseases. The objective of this project is to identify diagnostic molecular mechanisms of endometriosis using a multi-omics approach that will lead to noninvasive diagnostic procedure.

Research paper thumbnail of Mutational Forks: Inferring Deregulated Flow of Signal Transduction Based on Patient-Specific Mutations

2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

The precise mechanism behind treatment resistance in cancer is still not fully understood. Despit... more The precise mechanism behind treatment resistance in cancer is still not fully understood. Despite advances in precision oncology, there is a lack of tools that help to understand a mechanistic picture of treatment resistance in cancer patients. Existing enrichment methods heavily rely on quantitative data and limited to analysis of differentially expressed genes, ignoring crucial players that might be involved in this process. In order to tackle treatment resistance, the precise identification of deregulated flow of signal transduction is critical. Here, we introduce a bioinformatics framework that is capable of inferring deregulated flow of signal transduction given evidence-based knowledge about pathway topology and patient-specific mutations. While testing the proposed pipeline on a case study, our algorithm was able to confirm findings from biological experiment, where KRAS mutant cells developed treatment resistance to MEK inhibitor. Our model provides a framework for mechanistic understanding of acquired treatment resistance, thus, equipped clinicians with tool for searching more accurate diagnostic clues in patients with non-trivial disease representations.

Research paper thumbnail of Domain-specific Topic Model for Knowledge Discovery in Computational and Data-Intensive Scientific Communities

IEEE Transactions on Knowledge and Data Engineering

Shortened time to knowledge discovery and adapting prior domain knowledge is a challenge for comp... more Shortened time to knowledge discovery and adapting prior domain knowledge is a challenge for computational and dataintensive communities such as e.g., bioinformatics and neuroscience. The challenge for a domain scientist lies in the actions to obtain guidance through query of massive information from diverse text corpus comprising of a wide-ranging set of topics when: investigating new methods, developing new tools, or integrating datasets. In this paper, we propose a novel "domain-specific topic model" (DSTM) to discover latent knowledge patterns about relationships among research topics, tools and datasets from exemplary scientific domains. Our DSTM is a generative model that extends the Latent Dirichlet Allocation (LDA) model and uses the Markov chain Monte Carlo (MCMC) algorithm to infer latent patterns within a specific domain in an unsupervised manner. We apply our DSTM to large collections of data from bioinformatics and neuroscience domains that include more than 25,000 of papers over the last ten years, featuring hundreds of tools and datasets that are commonly used in relevant studies. Evaluation experiments based on generalization and information retrieval metrics show that our model has better performance than the state-of-the-art baseline models for discovering highly-specific latent topics within a domain. Lastly, we demonstrate applications that benefit from our DSTM to discover intra-domain, cross-domain and trend knowledge patterns.

Research paper thumbnail of Fuzzy-Engineered Multi-Cloud Resource Brokering for Data-intensive Applications

2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

Multi-cloud resource brokering is becoming a critical requirement for applications that require h... more Multi-cloud resource brokering is becoming a critical requirement for applications that require high scale, diversity, and resilience. Applications demand timely selection of distributed data storage and computation platforms that span local private cloud resources as well as resources from multiple cloud service providers (CSPs). The distinct capabilities and policies, as well as performance/cost of the cloud services, are amongst the prime factors for CSP selection. However, application owners who need suitable cyber resources in community/public clouds, often have preliminary knowledge and preferences of certain CSPs. They also lack expert guidance to handle the problem of overwhelming resource choice from CSPs, and optimization to compensate for service dynamics. In this paper, we address this challenge of optimal resource selection while also leveraging limited user's expertise and preferences towards CSPs through multi-level fuzzy logic modeling based on convoluted factors of performance, agility, cost, and security. We evaluate the efficiency of our fuzzy-engineered resource brokering in improving allocation of resources as well as user satisfiability by using case studies and independent validations of CSPs evaluation.

Research paper thumbnail of Application of SNPViz v2.0 using next-generation sequencing data sets in the discovery of potential causative mutations in candidate genes associated with phenotypes

International Journal of Data Mining and Bioinformatics

Single Nucleotide Polymorphisms (SNPs) and insertions/deletions (Indels) are the most common biol... more Single Nucleotide Polymorphisms (SNPs) and insertions/deletions (Indels) are the most common biological markers widely spread across all genome chromosomes. Owing to the large amount of SNPs and Indels data that have become available during the last ten years, it is a challenge to intuitively integrate, compare, or visualise them and effectively perform analysis across multiple samples simultaneously. Genome-Wide Association Studies (GWAS) is an approach to find genetic variants associated with a trait, but it lacks an efficient way of investigating genomic variant functions. To tackle these issues, we developed SNPViz v2.0, a web-based tool designed to visualise large-scale haplotype blocks with detailed SNPs and Indels grouped by their chromosomal coordinates, along with their overlapping gene models, phenotype to genotype accuracies, Gene Ontology (GO), protein families (Pfam), and their functional effects. SNPViz v2.0 is available in both SoyKB and KBCommons. For soya bean only, the SNPViz v2.

Research paper thumbnail of Security-aware Resource Brokering for Bioinformatics Workflows across Federated Multi-cloud Infrastructures

Proceedings of the 21st International Conference on Distributed Computing and Networking, 2020

Data-intensive science applications often use federated multi-cloud infrastructures to support th... more Data-intensive science applications often use federated multi-cloud infrastructures to support their compute-intensive processing needs. However, lack of knowledge about: a) individual domain's security policies, b) how that translates to application security assurance, and c) nature of performance and security trade-offs-can cause performance-security conflicts for applications and inefficient resource usage. In this paper, we propose a security-aware resource brokering middleware framework to allocate application resources by satisfying their performance and security requirements. The proposed middleware implements MCPS (Multi-Cloud Performance and Security) Broker that uses a common data model to represent applications' performance and security requirements. It performs a security-aware global scheduling to choose the optimal cloud domain, and a local scheduling to choose the optimal server within the chosen cloud domain. Using real SoyKB application workflows, we implement the proposed MCPS Broker in the GENI Cloud and demonstrate its utility through a NIST-guided risk assessment. CCS CONCEPTS • Security and privacy → Security requirements; • Networks → Cloud computing; Network resources allocation.

Research paper thumbnail of A Formative Usability Study to Improve Prescriptive Systems for Bioinformatics Big Data

2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2020

Big data computation tools are vital for researchers and educators from various domains such as p... more Big data computation tools are vital for researchers and educators from various domains such as plant science, animal science, biomedical science and others. With the growing computational complexity of biology big data, advanced analytic systems, known as prescriptive systems, are being built using machine learning models to intelligently predict optimum computation solutions for users for better data analysis. However, lack of user-friendly prescriptive systems poses a critical roadblock to facilitating informed decision-making by users. In this paper, we detail a formative usability study to address the complexities faced by users while using prescriptive systems. Our usability research approach considers bioinformatics workflows and community cloud resources in the KBCommons framework's science gateway. The results show that recommendations from usability studies performed in iterations during the development of prescriptive systems can improve user experience, user satisfaction and help novice as well as expert users to make decisions in a well-informed manner.

Research paper thumbnail of Domain-specific Topic Model for Knowledge Discovery through Conversational Agents in Data Intensive Scientific Communities

2018 IEEE International Conference on Big Data (Big Data), 2018

Machine learning techniques underlying Big Data analytics have the potential to benefit data inte... more Machine learning techniques underlying Big Data analytics have the potential to benefit data intensive communities in e.g., bioinformatics and neuroscience domain sciences. Today’s innovative advances in these domain communities are increasingly built upon multi-disciplinary knowledge discovery and cross-domain collaborations. Consequently, shortened time to knowledge discovery is a challenge when investigating new methods, developing new tools, or integrating datasets. The challenge for a domain scientist particularly lies in the actions to obtain guidance through query of massive information from diverse text corpus comprising of a wide-ranging set of topics. In this paper, we propose a novel "domain-specific topic model" (DSTM) that can drive conversational agents for users to discover latent knowledge patterns about relationships among research topics, tools and datasets from exemplar scientific domains. The goal of DSTM is to perform data mining to obtain meaningful guidance via a chatbot for domain scientists to choose the relevant tools or datasets pertinent to solving a computational and data intensive research problem at hand. Our DSTM is a Bayesian hierarchical model that extends the Latent Dirichlet Allocation (LDA) model and uses a Markov chain Monte Carlo algorithm to infer latent patterns within a specific domain in an unsupervised manner. We apply our DSTM to large collections of data from bioinformatics and neuroscience domains that include hundreds of papers from reputed journal archives, hundreds of tools and datasets. Through evaluation experiments with a perplexity metric, we show that our model has better generalization performance within a domain for discovering highly specific latent topics.

Research paper thumbnail of Community cloud architecture to improve use accessibility with security compliance in health big data applications

Proceedings of the 20th International Conference on Distributed Computing and Networking, 2019

The adoption of big data analytics in healthcare applications is overwhelming not only because of... more The adoption of big data analytics in healthcare applications is overwhelming not only because of the huge volume of data being analyzed, but also because of the heterogeneity and sensitivity of the data. Eective and ecient analysis and visualization of secure patient health records are needed to e.g., nd new trends in disease management, determining risk factors for diseases, and personalized medicine. In this paper, we propose a novel community cloud architecture to help clinicians and researchers to have easy/increased accessibility to data sets from multiple sources, while also ensuring security compliance of data providers is not compromised. Our cloud-based system design conguration with cloudlet principles ensures application performance has high-speed processing, and data analytics is suciently scalable while adhering to security standards (e.g., HIPAA, NIST). Through a case study, we show how our community cloud architecture can be implemented along with best practices in an ophthalmology case study which includes health big data (i.e., Health Facts database, I2B2, Millennium) hosted in a campus cloud infrastructure featuring virtual desktop thin-clients and relevant Data Classication Levels in storage.

Research paper thumbnail of Inductive Inference of Gene Regulatory Network Using Supervised and Semi-supervised Graph Neural Networks

Discovering gene regulatory relationships and reconstructing gene regulatory networks (GRN) based... more Discovering gene regulatory relationships and reconstructing gene regulatory networks (GRN) based on gene expression data is a classical, long-standing computational challenge in bioinformatics. Computationally inferring a possible regulatory relationship between two genes can be formulated as a link prediction problem between two nodes in a graph. Graph neural network (GNN) provides an opportunity to construct GRN by integrating topological neighbor propagation through the whole gene network. We propose an end-to-end gene regulatory graph neural network (GRGNN) approach to reconstruct GRNs from scratch utilizing the gene expression data, in both a supervised and a semi-supervised framework. To get better inductive generalization capability, GRN inference is formulated as a graph classification problem, to distinguish whether a subgraph centered at two nodes contains the link between the two nodes. A linked pair between a transcription factor (TF) and a target gene, and their neighb...

Research paper thumbnail of Knowledge Base Commons (KBCommons) v1.1: a universal framework for multi-omics data integration and biological discoveries

BMC Genomics, 2019

Background Knowledge Base Commons (KBCommons) v1.1 is a universal and all-inclusive web-based fra... more Background Knowledge Base Commons (KBCommons) v1.1 is a universal and all-inclusive web-based framework providing generic functionalities for storing, sharing, analyzing, exploring, integrating and visualizing multiple organisms’ genomics and integrative omics data. KBCommons is designed and developed to integrate diverse multi-level omics data and to support biological discoveries for all species via a common platform. Methods KBCommons has four modules including data storage, data processing, data accessing, and web interface for data management and retrieval. It provides a comprehensive framework for new plant-specific, animal-specific, virus-specific, bacteria-specific or human disease-specific knowledge base (KB) creation, for adding new genome versions and additional multi-omics data to existing KBs, and for exploring existing datasets within current KBs. Results KBCommons has an array of tools for data visualization and data analytics such as multiple gene/metabolite search, ...

Research paper thumbnail of Machine Learning Classifiers for Endometriosis Using Transcriptomics and Methylomics Data

Frontiers in Genetics, 2019

Research paper thumbnail of PGen: large-scale genomic variations analysis workflow and browser in SoyKB

BMC Bioinformatics, 2016

Background: With the advances in next-generation sequencing (NGS) technology and significant redu... more Background: With the advances in next-generation sequencing (NGS) technology and significant reductions in sequencing costs, it is now possible to sequence large collections of germplasm in crops for detecting genome-scale genetic variations and to apply the knowledge towards improvements in traits. To efficiently facilitate large-scale NGS resequencing data analysis of genomic variations, we have developed "PGen", an integrated and optimized workflow using the Extreme Science and Engineering Discovery Environment (XSEDE) high-performance computing (HPC) virtual system, iPlant cloud data storage resources and Pegasus workflow management system (Pegasus-WMS). The workflow allows users to identify single nucleotide polymorphisms (SNPs) and insertion-deletions (indels), perform SNP annotations and conduct copy number variation analyses on multiple resequencing datasets in a user-friendly and seamless way. Results: We have developed both a Linux version in GitHub (https://github.com/pegasus-isi/PGen-GenomicVariations-Workflow) and a web-based implementation of the PGen workflow integrated within the Soybean Knowledge Base (SoyKB), (http://soykb.org/Pegasus/index.php). Using PGen, we identified 10,218,140 single-nucleotide polymorphisms (SNPs) and 1,398,982 indels from analysis of 106 soybean lines sequenced at 15X coverage. 297,245 non-synonymous SNPs and 3330 copy number variation (CNV) regions were identified from this analysis. SNPs identified using PGen from additional soybean resequencing projects adding to 500+ soybean germplasm lines in total have been integrated. These SNPs are being utilized for trait improvement using genotype to phenotype prediction approaches developed in-house. In order to browse and access NGS data easily, we have also developed an NGS resequencing data browser (http://soykb.org/NGS_Resequence/NGS_index.php) within SoyKB to provide easy access to SNP and downstream analysis results for soybean researchers. Conclusion: PGen workflow has been optimized for the most efficient analysis of soybean data using thorough testing and validation. This research serves as an example of best practices for development of genomics data analysis workflows by integrating remote HPC resources and efficient data management with ease of use for biological users. PGen workflow can also be easily customized for analysis of data in other species.

Research paper thumbnail of RDF Sketch Maps - Knowledge Complexity Reduction for Precision Medicine Analytics

Biocomputing 2016, 2015

Realization of precision medicine ideas requires significant research effort to be able to spot s... more Realization of precision medicine ideas requires significant research effort to be able to spot subtle differences in complex diseases at the molecular level to develop personalized therapies. It is especially important in many cases of highly heterogeneous cancers. Precision diagnostics and therapeutics of such diseases demands interrogation of vast amounts of biological knowledge coupled with novel analytic methodologies. For instance, pathway-based approaches can shed light on the way tumorigenesis takes place in individual patient cases and pinpoint to novel drug targets. However, comprehensive analysis of hundreds of pathways and thousands of genes creates a combinatorial explosion, that is challenging for medical practitioners to handle at the point of care. Here we extend our previous work on mapping clinical omics data to curated Resource Description Framework (RDF) knowledge bases to derive influence diagrams of interrelationships of biomarker proteins, diseases and signal transduction pathways for personalized theranostics. We present RDF Sketch Maps-a computational method to reduce knowledge complexity for precision medicine analytics. The method of RDF Sketch Maps is inspired by the way a sketch artist conveys only important visual information and discards other unnecessary details. In our case, we compute and retain only so-called RDF Edges-places with highly important diagnostic and therapeutic information. To do this we utilize 35 maps of human signal transduction pathways by transforming 300 KEGG maps into highly processable RDF knowledge base. We have demonstrated potential clinical utility of RDF Sketch Maps in hematopoietic cancers, including analysis of pathways associated with Hairy Cell Leukemia (HCL) and Chronic Myeloid Leukemia (CML) where we achieved up to 20-fold reduction in the number of biological entities to be analyzed, while retaining most likely important entities. In experiments with pathways associated with HCL a generated RDF Sketch Map of the top 30% paths retained important information about signaling cascades leading to activation of proto-oncogene BRAF, which is usually associated with a different cancer, melanoma. Recent reports of successful treatments of HCL patients by the BRAF-targeted drug vemurafenib support the validity of the RDF Sketch Maps findings. We therefore believe that RDF Sketch Maps will be invaluable for hypothesis generation for precision diagnostics and therapeutics as well as drug repurposing studies.

Research paper thumbnail of Virtual physical examination (VPE): a multimedia system for education in medicine

International Journal of Functional Informatics and Personalised Medicine, 2014

The virtual physical examination (VPE) platform is a web-based, multimedia system for medical exa... more The virtual physical examination (VPE) platform is a web-based, multimedia system for medical examination education, distributed and supported by Cerner. The system was built using MySQL, Flash and PHP and developed for creating and conducting physical examinations on virtual patient cases in a simulated environment. VPE allows users to perform physical examinations virtually on patients. A user can create 3D avatars of patients, build medical cases, perform diagnosis and attach associated files including audio, video, image, text and interactive assets. VPE allows for controlled sharing of assets as well as full cases, either within the same organisation or publicly. A video demonstrating VPE can be found at http://digbio.missouri. edu/Cerner/Cerner_VPE_Demo.mp4. VPE has broad educational applications ranging from the basic introduction of medicine to high school students to advanced education for nursing and medical students. The solution is used to both teach and reinforce physical exam concepts. VPE is available at https://vpe.cernerlearningmanager.com.

Research paper thumbnail of A Linear Programming Framework for Inferring Gene Regulatory Networks by Integrating Heterogeneous Data

Research paper thumbnail of SoyMetDB: The soybean metabolome database

2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2010

Research paper thumbnail of Identification and evaluation of quantitative trait loci underlying resistance to multiple HG types of soybean cyst nematode in soybean PI 437655

Theoretical and Applied Genetics, 2014

SCN resistance in PI 437655, and to evaluate the QTL for their contribution to SCN resistance. Tw... more SCN resistance in PI 437655, and to evaluate the QTL for their contribution to SCN resistance. Two F 6:7 recombinant inbred line populations, derived from cv. Williams 82 × PI 437655 and cv. Hutcheson × PI 437655 crosses, were evaluated for resistance to SCN HG types 1.2.5.7 (PA2), 0 (PA3), 1.3.5.6.7 (PA14), and 1.2.3.4.5.6.7 (LY2). The 1,536 SNP array was used to genotype the mapping populations and construct genetic linkage maps. Two significant QTL were consistently mapped on chromosomes (Chr.) 18 and 20 in these two populations. One QTL on Chr. 18, which corresponds to the known Rhg1 locus, contributed resistance to SCN HG types 1.

Research paper thumbnail of ADON: Application-driven Overlay Network-as-a-Service for data-intensive science

2014 IEEE 3rd International Conference on Cloud Networking (CloudNet), 2014

Campuses are increasingly adopting hybrid cloud architectures for supporting data-intensive scien... more Campuses are increasingly adopting hybrid cloud architectures for supporting data-intensive science applications that require "on-demand" resources, which are not always available locally on-site. Policies at the campus edge for handling multiple such applications competing for remote resources can cause bottlenecks across applications. These bottlenecks can be proactively avoided with pertinent profiling, monitoring and control of application flows using software-defined networking principles. In this paper, we present an "Application-driven Overlay Network-as-a-Service" (ADON) that can manage the hybrid cloud requirements of multiple applications in a scalable and extensible manner using features such as: programmable "custom templates" and a "virtual tenant handler". Our solution approach involves scheduling transit selection and traffic engineering at the campus-edge based on real-time policy control that ensures predictable application performance delivery for multi-tenant traffic profiles. We validate our ADON approach with an implementation on a wide-area overlay network testbed across two campuses, and present a workflow that eases the orchestration of network programmability for campus network providers and data-intensive application users. Lastly, we present an emulation study of the ADON effectiveness in handling temporal behavior of multi-tenant traffic burst arrivals using profiles from a diverse set of actual data-intensive applications. I. INTRODUCTION Data-intensive applications in research fields such as bioinformatics, climate modeling, particle physics and genomics generate vast amounts of data that need to be processed with real-time analysis. The general data processing facilities and specialized compute resources do not always reside at the data generation sites on campus, and data is frequently transferred in real-time to geographically distributed sites (e.g., remote instrumentation site, federated data repository, public cloud) over wide-area networks. Moreover, researchers share workflows of their data-intensive applications with remote collaborators for multidisciplinary initiatives on multi-domain physical networks [1]. Current campus network infrastructures place stringent security policies at the edge router/switch and install firewalls to defend the campus local-area network (LAN) from potential cyber attacks. Such defense mechanisms significantly impact research traffic especially in the case of data-intensive science applications whose flows traverse wide-area network (WAN) paths. This has prompted campuses to build Science DMZs (de-militarized zones) [1] with high-speed (1-100 Gbps) programmable networks to provide dedicated network infrastructures for research traffic flows that need to be handled in parallel to the regular enterprise traffic.

Research paper thumbnail of A critical assessment of Mus musculus gene function prediction using integrated genomic evidence

Genome Biology, 2008

Background: Several years after sequencing the human genome and the mouse genome, much remains to... more Background: Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated. Results: In this study, a standardized collection of mouse functional genomic data was assembled; nine bioinformatics teams used this data set to independently train classifiers and generate predictions of function, as defined by Gene Ontology (GO) terms, for 21,603 mouse genes; and the best performing submissions were combined in a single set of predictions. We identified strengths and weaknesses of current functional genomic data sets and compared the performance of function prediction algorithms. This analysis inferred functions for 76% of mouse genes, including 5,000 currently uncharacterized genes. At a recall rate of 20%, a unified set of predictions averaged 41% precision, with 26% of GO terms achieving a precision better than 90%. Conclusion: We performed a systematic evaluation of diverse, independently developed computational approaches for predicting gene function from heterogeneous data sources in mammals. The results show that currently available data for mammals allows predictions with both breadth and accuracy. Importantly, many highly novel predictions emerge for the 38% of mouse genes that remain uncharacterized.