A hybrid clustering approach to recognition of protein families in 114 microbial genomes - PubMed (original) (raw)

A hybrid clustering approach to recognition of protein families in 114 microbial genomes

Timothy J Harlow et al. BMC Bioinformatics. 2004.

Abstract

Background: Grouping proteins into sequence-based clusters is a fundamental step in many bioinformatic analyses (e.g., homology-based prediction of structure or function). Standard clustering methods such as single-linkage clustering capture a history of cluster topologies as a function of threshold, but in practice their usefulness is limited because unrelated sequences join clusters before biologically meaningful families are fully constituted, e.g. as the result of matches to so-called promiscuous domains. Use of the Markov Cluster algorithm avoids this non-specificity, but does not preserve topological or threshold information about protein families.

Results: We describe a hybrid approach to sequence-based clustering of proteins that combines the advantages of standard and Markov clustering. We have implemented this hybrid approach over a relational database environment, and describe its application to clustering a large subset of PDB, and to 328577 proteins from 114 fully sequenced microbial genomes. To demonstrate utility with difficult problems, we show that hybrid clustering allows us to constitute the paralogous family of ATP synthase F1 rotary motor subunits into a single, biologically interpretable hierarchical grouping that was not accessible using either single-linkage or Markov clustering alone. We describe validation of this method by hybrid clustering of PDB and mapping SCOP families and domains onto the resulting clusters.

Conclusion: Hybrid (Markov followed by single-linkage) clustering combines the advantages of the Markov Cluster algorithm (avoidance of non-specific clusters resulting from matches to promiscuous domains) and single-linkage clustering (preservation of topological information as a function of threshold). Within the individual Markov clusters, single-linkage clustering is a more-precise instrument, discerning sub-clusters of biological relevance. Our hybrid approach thus provides a computationally efficient approach to the automated recognition of protein families for phylogenomic analysis.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Single-linkage clustering of multi-genome data (a) Number of clusters of n ≥ 4 members each produced by single-linkage clustering of proteins in 114 microbial genomes (without prior Markov clustering), as a function of S'norm threshold; (b) number of proteins in single-linkage clusters (n ≥ 4), as a function of threshold; (c) number of proteins in the largest single-linkage cluster, as a function of threshold.

Figure 2

Figure 2

Hybrid clustering of multi-genome data (a) Number of clusters of n ≥ 4 members each produced by hybrid (Markov followed by single-linkage) clustering of proteins in 114 microbial genomes, as a function of S'norm threshold. Compare the value at the right-most point on the distribution (S'norm 0.01) with that in Figure 1 to see the effect of the prior Markov clustering step; (b) number of proteins in hybrid clusters (n ≥ 4), as a function of threshold; (c) number of proteins in the largest hybrid cluster, as a function of threshold. Note that the vertical axis is scaled differently than in Figure 1c.

Figure 3

Figure 3

Markov clusters as a function of inflation value Markov clustering of proteins in 114 microbial genomes at six Markov inflation values, showing numbers of clusters as a function of number of proteins per cluster.

Figure 4

Figure 4

Genome representation in MRCs Numbers of maximally representative clusters of size 4 (the minimum cluster size considered in this work) to 114 (the number of genomes analysed).

Figure 5

Figure 5

Bacterial phylum representation in MRCs Numbers of maximally representative clusters (n ≥ 4) as a function of number of bacterial "phyla" (second-order NCBI classications, e.g. Aquificales, Bacteriodetes, etc.) represented in each.

Figure 6

Figure 6

Threshold and range distributions of MRCs (a) Numbers of maximally representative clusters (n ≥ 4), as a function of maximum threshold expressed as S'norm; (b) numbers of maximally representative clusters (n ≥ 4), as a function of minimum threshold expressed as S'norm. Note the 1531 MRCs at S'norm = 0.01; (c) numbers of maximally representative clusters (n ≥ 4), as a function of range of maximality (extent along S'norm). The range of maximality of a maximal cluster is the length of the internal edge immediately subtending it.

Figure 7

Figure 7

Clustering of ATP synthase F1 paralog sequences Membership in the ATP synthase F1 cluster, as a function of S'norm threshold. Single-linkage and hybrid clustering gave identical results at S'norm ≥ 0.22; cluster structure below S'norm 0.22 is for our hybrid method only (see text). NCBI gi numbers are displayed across the top for all F1β subunit sequences, and for three singleton sequences that group with this paralogous family. Large adjacent dots depict clusters at S'norm 1.00, and small adjacent dots show singleton sequences at S'norm 1.00 that are clustered at 0.99.

Similar articles

Cited by

References

    1. Ragan MA, Charlebois RL. Distributional profiles of homologous open reading frames among bacterial phyla: implications for vertical and lateral transmission. Intl J Syst Evol Microbiol. 2002;52:777–787. doi: 10.1099/ijs.0.02026-0. - DOI - PubMed
    1. Raymond J, Zhaxybayeva O, Gogarten JP, Gerdes SY, Blankenship RE. Whole-genome analysis of photosynthetic prokaryotes. Science. 2002;298:1616–1620. doi: 10.1126/science.1075558. - DOI - PubMed
    1. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucl Acids Res. 2001;29:22–28. doi: 10.1093/nar/29.1.22. - DOI - PMC - PubMed
    1. Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucl Acids Res. 2002;30:1575–1584. doi: 10.1093/nar/30.7.1575. - DOI - PMC - PubMed
    1. Heger A, Holm L. Exhaustive enumeration of protein domain families. J Mol Biol. 2003;328:749–767. doi: 10.1016/S0022-2836(03)00269-9. - DOI - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources