Inferring correlation networks from genomic survey data - PubMed (original) (raw)

Inferring correlation networks from genomic survey data

Jonathan Friedman et al. PLoS Comput Biol. 2012.


High-throughput sequencing based techniques, such as 16S rRNA gene profiling, have the potential to elucidate the complex inner workings of natural microbial communities - be they from the world's oceans or the human gut. A key step in exploring such data is the identification of dependencies between members of these communities, which is commonly achieved by correlation analysis. However, it has been known since the days of Karl Pearson that the analysis of the type of data generated by such techniques (referred to as compositional data) can produce unreliable results since the observed data take the form of relative fractions of genes or species, rather than their absolute abundances. Using simulated and real data from the Human Microbiome Project, we show that such compositional effects can be widespread and severe: in some real data sets many of the correlations among taxa can be artifactual, and true correlations may even appear with opposite sign. Additionally, we show that community diversity is the key factor that modulates the acuteness of such compositional effects, and develop a new approach, called SparCC (available at, which is capable of estimating correlation values from compositional data. To illustrate a potential application of SparCC, we infer a rich ecological network connecting hundreds of interacting species across 18 sites on the human body. Using the SparCC network as a reference, we estimated that the standard approach yields 3 spurious species-species interactions for each true interaction and misses 60% of the true interactions in the human microbiome data, and, as predicted, most of the erroneous links are found in the samples with the lowest diversity.

The authors have declared that no competing interests exist.


Figure 1

Figure 1. Similar correlation networks are observed for real world vs. randomly shuffled bacterial abundance data.

Correlation networks based on 16S rRNA gene survey data collected as part of the Human Microbiome Project (HMP), inferred using Pearson correlations (left column), and SparCC (right column). Additionally, Pearson correlation networks were inferred from shuffled HMP data (middle column), where all OTUs are independent. The Pearson networks inferred from shuffled data show patterns similar to the ones seen in the Pearson networks of the real data, especially for low diversity body sites. This indicates that the observed Pearson network structure may be due to biases inherent in compositional data rather than a real biological signal. In contrast, no significant correlation were inferred from the shuffled data using SparCC (data not shown). Nodes represent OTUs, with size reflecting the OTU's average fraction in the community. Edges between nodes represent correlations between the nodes they connect, with edge width and shade indicating the correlation magnitude, and green and red colors indicating positive and negative correlations, respectively. For clarity, only edges corresponding to correlations whose magnitude is greater than 0.3 are drawn. See Fig. S1 for all 18 HMP body sites.

Figure 2

Figure 2. Pearson correlations inference quality deteriorates with decreasing diversity.

Basis data was simulated with a known correlation structure. OTU counts were generated by randomly drawing from the basis, and were subsequently subject to both correlation inference procedures. (A–C) True basis correlation network. (D–F) Networks inferred using standard procedure. (G–I) Networks inferred using SparCC. The average community diversities, as given by the Shannon entropy effective number of components formula image, used in the simulations and observed in the HMP data are indicated on left indicates. As in Fig. 1, nodes represent OTUs, with size reflecting the OTU's average fraction in the community. Nodes represent OTUs, with size reflecting the OTU's average fraction in the community. Edges between nodes represent correlations between the nodes they connect, with edge width and shade indicating the correlation magnitude, and green and red colors indicating positive and negative correlations, respectively. For clarity, only edges corresponding to correlations whose magnitude is greater than 0.3 are drawn.

Figure 3

Figure 3. SparCC outperforms standard inference.

Root-mean-square error (RMSE) of both Pearson (A) and SparCC (B) inferred correlations, as a function of the density of the underlying correlation network, as given by the probability that any pair of components be strongly correlated formula image, and community diversity, as given by the Shannon entropy effective number of components formula image. SparCC errors are smaller than Pearson errors for all parameter values. For the maximal diversity plotted, 50 effective OTU, the inference error obtained using Pearson correlations is greatly decreased. Therefore, it is likely that Pearson correlations perform well on gene expression data, where the effective number of genes is typically in the hundreds or thousands. For each combination of density and diversity, multiple basis correlation networks were randomly generated, and corresponding data was sampled and used for correlation estimation. Dots labeled mid-vagina and gut indicate the average diversity observed in the mid-vagina and gut communities, and the density of their estimated correlation networks. Dots labeled 2D–I indicate the diversity and density used to generate the communities analyzed in Fig. 2.

Figure 4

Figure 4. HMP correlation networks inferred using SparCC.

Networks inferred using SparCC from the same data as in Fig. 1 (see Fig. S2 for SparCC networks of all HMP body sites). No correlations with magnitude greater than the 0.3 cutoff were inferred from the shuffled data (not shown). Nodes represent OTUs, with size reflecting the OTU's average fraction in the community, and color corresponding to the phylum to which the OTU belongs. Edges between nodes represent correlations between the nodes they connect, with edge width and shade indicating the correlation magnitude, and green and red colors indicating positive and negative correlations, respectively. For clarity, only edges corresponding to correlations whose magnitude is greater than 0.3 are drawn, and unconnected nodes are omitted. See Fig. S6 for all 18 HMP body sites.

Figure 5

Figure 5. Flow chart of iterative basis correlation inference procedure.

Grants and funding

This work was conducted by ENIGMA- Ecosystems and Networks Integrated with Genes and Molecular Assemblies (, a Scientific Focus Area Program at Lawrence Berkeley National Laboratory, was supported by the Office of Science, Office of Biological and Environmental Research, of the U. S. Department of Energy under Contract No. DE-AC02-05CH11231. JF was supported by the Merck-MIT Fellowship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

