Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities - PubMed (original) (raw)

Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities

Marinka Zitnik et al. Inf Fusion. 2019 Oct.

Abstract

New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include myriad properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field.

Keywords: computational biology; heterogeneous data; machine learning; personalized medicine; systems biology.

PubMed Disclaimer

Figures

Figure 1:

Figure 1:. The importance of data integration in biomedicine.

Considering variation in only a single data type can miss many important patterns that can only be observed by considering multiple levels of biomedical data. Shown is a hypothetical example using disease diagnostics as a point of interest. When a new patient arrives to the clinic, (a) domain experts sequence the patient’s genome and compare it with a database to identify mutations and disease-causing genes, (b) perform laboratory tests using tissue samples, and (c) process information about the patient’s behavior and lifestyle. (d) The patient’s genomic, transcriptomic, and lifestyle information is combined with curated databases of biomedical knowledge (e.g., disease and metabolic pathways). Finally, a machine learning algorithm predicts probability that the patient will develop a particular disease in near future. To make accurate prediction, the machine learning model needs to use many different types of data. This example illustrates that accurate prediction can only be made by analyzing multiple types of patient’s data.

Figure 2:

Figure 2:. Catagorization of approaches for data integration

(a) Examples of multiomics data about patients. (b-d) Data integration approaches can be divided into three categories. (b) _Early integration approaches_involve combining datasets from different data types at the raw or processed level before analysis and prediction. (c) Intermediate integration approaches transform or map the underlying datasets at the same time as they estimate model parameters. (d) Late integration approaches perform analysis on each dataset independently, whichisfollowedbyintegrationoftheresultingmodelstogeneratepredictions, e.g., prognosis for a particular patient. SNP, single-nucleotide polymorphism

Figure 3:

Figure 3:. Data integration.

Data integration approaches combine multiples sources of information in a statistically meaningful way to provide a comprehensive analysis of a biomedical point of interest. Broadly, existing approaches employ three distinct modeling strategies (i.e., early, intermediate, and late integration; see also Figure 2) and produce three types of prediction outputs (i.e., a label representing probability of an entity belonging to a given class; a relationship representing probability of an association between two entities; and a complex structure, such as an inferred network or a partitioning of entities into groups).

Figure 4:

Figure 4:. Scheme of single-cell multi-omics data integration.

A generic bioinformatic analysis workflow usually includes three steps: first, the raw data are preprocessed, filtered, and quality-controlled separately for each assayed omics dimension, accounting for the analytical challenges of single-cell data, such as technical variation, sparse signal, and amplified artifacts. Second, as single-cell data are intrinsically of low coverage, it is a good practice to increase the signal to noise ratio by aggregating data; for example, by combining expression levels of genes of similar function or similar DNA methylation levels across genomic regions bound by the same transcription factors. Finally, data are integrated into one multi-omics map, representing a data-driven single-cell model.

Figure 5:

Figure 5:. A matrix-based representation of diverse datasets relevant for gene function prediction.

Let us consider a hypothetical gene function prediction task. Here, the function is response to bacterial infection [187], meaning that the task is to identify genes in an eukaryotic organism that determine how the organism will respond to a bacterial infection. There is a variety of diverse datasets potentially relevant for this task and each dataset is typically represented with a separate data matrix. Shown is an example with six data matrices, including gene-phenotype associations, gene expression profiles, biomedical literature, and annotations of research papers. Integrative approaches solve the gene function predict task by establishing a rigorous statistical correspondence between different input dimensions of these seemingly disparate data matrices [48, 43, 188, 189, 190, 191, 33, 27, 192, 193]. For example, genes can be linked to the MeSH concepts in the Medical Subject Headings database via gene-publication relationships (i.e., lists of genes discussed in a given research paper), followed by publication-MeSH relationships (i.e., lists of the MeSH concepts assigned to a given research paper). For example, a collective matrix factorization approach in [33] can fuse such complex systems of data matrices. The approach has been used to predict gene functions in various species [33, 190] and has subsequently been applied to prioritization of genes mediating bacterial infections [194].

Figure 6:

Figure 6:. Gene prioritization.

Gene prioritization aims to identify the most promising genes among a list of candidate genes with respect to a biological process of interest. The biological process of interest is most often represented by a small set of seed genes that are known to be involved in the process. Typically, gene lists generated by traditional disease gene hunting techniques generate dozens or hundreds of genes among which only one or a few are of primary interest. The overall goal is to identify these genes and, in a second step, experimentally validate these genes only. Many different computational methods that use different algorithms, datasets, and strategies have been developed [224, 227, 228, 222, 229, 194, 230, 231, 232]. Some of these approaches have been implemented as publicly available tools and several of these approaches have been experimentally validated [224, 228, 229, 194, 225].

Figure 7:

Figure 7:. A network-based approach to cellular function prediction.

Bio-logical networks are a powerful representation for the discovery of interactions and emergent properties in biological systems, ranging from cell type identifi-cation at a single-cell level to disease treatment at a patient level. Fundamental to biological networks is the principle that genes involved in the same cellular function or underlying the same phenotype tend to interact [49]. This principle has been used many times to combine and to amplify signals from individual genes, and has led to remarkable discoveries in biology. For example, network-based methods for protein function prediction [247, 248, 23, 249] often use a heterogeneous protein-protein interaction network and conduct a large number of random walks on the network that are biased towards visiting known proteins associated with a specific function. These methods then calculate a score for each protein representing the probability that a protein is involved in a given cellular function based on how often the protein’s node in the network is visited by random walkers.

Figure 8:

Figure 8:. Drug-target and drug-drug interactions.

A heterogeneous network representation of drugs and proteins targeted by the drugs. In addition to interaction information, e.g., drug-drug interactions, drug-protein interactions, and protein-protein interactions (Section 8), each node in the network has a feature vector describing important biological characteristics of the node,e.g., drug’s chemical structure, and protein’s activity in tissues. Such networks are used to address two important tasks in computational pharmacology. The first is the prediction of drug-target interactions [260, 264, 19, 265], which are fundamental to the way that drugs work and often provide an important foundation for other tasks in the computational pharmacology. The second is the prediction of drug-drug interactions [273, 274, 270, 275], which are fundamental to modeling drug combinations and identifying drug pairs whose combination gives an exaggerated response beyond the response expected under no interaction. Zit-nik et al. [45] use heterogeneous networks, such as the one shown in the figure, and develop a graph convolutional deep network approach to predict which side effects a patient might develop when taking multiple drugs at the same time.

Figure 9:

Figure 9:. Drug repurposing.

One exciting application of computational pharmacology is drug repurposing [252, 20]. Drug repurposing uses computational methods to find new uses for existing drugs. Given a disease, the task is to predict drugs (e.g., among all drugs approved for use by the U.S. Food and Drug Administration) that might threat that disease. Integrative methods for drug repurposing comprise similarity-based methods [317], network modeling [260, 322, 272], and matrix factorization [324].

Figure 10:

Figure 10:. Disease subtyping.

Many diseases are heterogeneous. _Disease sub-typing_stratifies a heterogeneous group of patients with a particular disease into homogeneous subgroups, i.e., subtypes, based on clinical, molecular, and other types of patient features. Accurate clustering of patients into subtypes is an important step towards personalized medicine and can inform clinical decision making and treatment matching.

Similar articles

Cited by

References

    1. Consortium EP, et al., An integrated encyclopedia of DNA elements in the human genome, Nature 489 (7414) (2012) 57. - PMC - PubMed
    1. Kundaje A, et al., Integrative analysis of 111 reference human epigenomes, Nature 518 (7539) (2015) 317–330. - PMC - PubMed
    1. Quake SR, Wyss-Coray T, Darmanis S, Consortium TM, et al., Single-cell transcriptomic characterization of 20 organs and tissues from individual mice creates a Tabula Muris, bioRxiv (2018) 237446.
    1. Wilhelm M, Schlegl J, Hahne H, Gholami AM, Lieberenz M, Savitski MM, Ziegler E, Butzmann L, Gessulat S, Marx H, et al., Mass-spectrometry-based draft of the human proteome, Nature 509 (7502) (2014) 582. - PubMed
    1. Costanzo M, VanderSluis B, Koch EN, Baryshnikova A, Pons C,Tan G, Wang W, Usaj M, Hanchard J, Lee SD, et al., A global genetic interaction network maps a wiring diagram of cellular function, Science 353 (6306) (2016) aaf1420. - PMC - PubMed

LinkOut - more resources