Multivariate Cutoff Level Analysis (MultiCoLA) of large community data sets - PubMed (original) (raw)

Multivariate Cutoff Level Analysis (MultiCoLA) of large community data sets

Angélique Gobet et al. Nucleic Acids Res. 2010 Aug.

Abstract

High-throughput sequencing techniques are becoming attractive to molecular biologists and ecologists as they provide a time- and cost-effective way to explore diversity patterns in environmental samples at an unprecedented resolution. An issue common to many studies is the definition of what fractions of a data set should be considered as rare or dominant. Yet this question has neither been satisfactorily addressed, nor is the impact of such definition on data set structure and interpretation been fully evaluated. Here we propose a strategy, MultiCoLA (Multivariate Cutoff Level Analysis), to systematically assess the impact of various abundance or rarity cutoff levels on the resulting data set structure and on the consistency of the further ecological interpretation. We applied MultiCoLA to a 454 massively parallel tag sequencing data set of V6 ribosomal sequences from marine microbes in temperate coastal sands. Consistent ecological patterns were maintained after removing up to 35-40% rare sequences and similar patterns of beta diversity were observed after denoising the data set by using a preclustering algorithm of 454 flowgrams. This example validates the importance of exploring the impact of the definition of rarity in large community data sets. Future applications can be foreseen for data sets from different types of habitats, e.g. other marine environments, soil and human microbiota.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Two ways of assigning rarity cutoffs to the original data set. (A) In the data set-based approach, cutoff levels are assigned to the original data set according to several percentages (0, 1, 5–95 and 99%) of the total number of sequences in the data set. The data set was sorted according to the decreasing total sum of OTU sequences (columns, here) before selecting out rare OTUs. For instance, a cutoff assignment of 1% removes 1% of the low-abundant OTUs. (B) In the sample-based approach, cutoff levels are assigned to the original data set according to the occurrence (1–208 sequences) of each OTU in each sample. The maximum cutoff (here, 208) was chosen according to the lowest number of the maximum OTU occurrences in all samples; this is the limit when some samples did not contain any more OTUs. For example, the assignment of a cutoff level of 3 removes OTUs occurring less than three times in each sample.

Figure 2.

Figure 2.

MultiCoLA steps. After truncating the original table according to various abundance cutoff levels, the effects of specific rarity definitions are tested by applying three types of analyses: (1) Variations in data set structure are established based on non-parametric correlations of pairwise distance matrices (e.g. calculated with the Bray–Curtis coefficient). (2) The amounts of extracted community variation (using NMDS) from the original data and the truncated data sets are compared by Procrustes correlations. (3) When additional parameters are available, the biological variation that can be explained by environmental parameters in the original and in the truncated data sets are then systematically compared. D, dominant OTUs; R, rare OTUs.

Figure 3.

Figure 3.

MultiCoLA profiles for data set structure, most important axes of extracted variation and interpretation of biological variation based on the data set-based (A–D) and sample-based (E–H) approaches. (A, E) Abundance of dominant OTUs in each truncated data set at the phylum, class, order, family, genus and OTU levels. A black solid line indicates comparisons at the OTU level for the data set with a complete annotation and a black dashed line indicates the OTU level with the whole data set (OTU whole DS). (B, F) Non-parametric Spearman correlations comparing the deviation in complete data structure between the original matrix and truncated matrices. (C, G) Comparison of most important axes of extracted variation between the original and truncated data sets. (D, H) Partitioning of the biological variation at the OTU level (all OTUs) into the respective effects of environmental factors (nutrients and cell abundance). Negative values, unexplained variation and non-significant models are not shown. SiO2, silicate; PO4, phosphate; NH4, ammonium; covariation of any of the four environmental factors is represented under the same category. Asterisk indicates a significant effect of the pure factors (P < 5%), whereas ‘NS’ indicates non-significant models. A cross indicates non-significant Bonferroni corrected models. Lacking points or bars are due to sample loss by applying a given cutoff to the original data set. In (E–H), the upper _x_-axis corresponds to cutoff levels defined as a function of the sample-based approach, and the lower _x_-axis represents the corresponding proportion of removed sequences in the OTU data set (all OTUs). This enables the comparison of the data set-based approach with the sample-based approach. Note that (D and H) have a different legend than (A–C) and (E–G).

Figure 4.

Figure 4.

MultiCoLA profiles for data set structure and most important axes of extracted variation based on the data set (A–C) and sample (D–F) cutoff approaches for PyroNoise-corrected 454 MPTS data and the original 454 MPTS data set at the OTU level. Different colored lines indicate PyroNoise-corrected data sets whose sequences were further clustered at various sequence dissimilarity values. See Figure 3 for further details.

Figure 5.

Figure 5.

MultiCoLA profiles using the matrix with the most abundant OTUs as a reference for the comparison with the truncated matrices. (A–C) are based on the data set-based approach and (D–F) on the sample-based approach. See Figure 3 for further descriptions of each panel.

References

    1. Gauch HG. Multivariate Analyses in Community Ecology. Cambridge: Cambridge University Press; 1982.
    1. Prendergast JR, Quinn RM, Lawton JH, Eversham BC, Gibbons DW. Rare species, the coincidence of diversity hotspots and conservation strategies. Nature. 1993;365:335–337.
    1. Magurran AE, Henderson PA. Explaining the excess of rare species in natural species abundance distributions. Nature. 2003;422:714–716. - PubMed
    1. Pedrós-Alió C. Marine microbial diversity: can it be determined? Trends Microbiol. 2006;14:257–263. - PubMed
    1. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, Arrieta JM, Herndl GJ. Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc. Natl Acad. Sci. USA. 2006;103:12115–12120. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources