UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches - PubMed (original) (raw)

UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches

Baris E Suzek et al. Bioinformatics. 2015.

Abstract

Motivation: UniRef databases provide full-scale clustering of UniProtKB sequences and are utilized for a broad range of applications, particularly similarity-based functional annotation. Non-redundancy and intra-cluster homogeneity in UniRef were recently improved by adding a sequence length overlap threshold. Our hypothesis is that these improvements would enhance the speed and sensitivity of similarity searches and improve the consistency of annotation within clusters.

Results: Intra-cluster molecular function consistency was examined by analysis of Gene Ontology terms. Results show that UniRef clusters bring together proteins of identical molecular function in more than 97% of the clusters, implying that clusters are useful for annotation and can also be used to detect annotation inconsistencies. To examine coverage in similarity results, BLASTP searches against UniRef50 followed by expansion of the hit lists with cluster members demonstrated advantages compared with searches against UniProtKB sequences; the searches are concise (∼7 times shorter hit list before expansion), faster (∼6 times) and more sensitive in detection of remote similarities (>96% recall at e-value <0.0001). Our results support the use of UniRef clusters as a comprehensive and scalable alternative to native sequence databases for similarity searches and reinforces its reliability for use in functional annotation.

© The Author 2014. Published by Oxford University Press.

PubMed Disclaimer

Figures

Fig. 1.

Fig. 1.

The categories of UniRef clusters based on intra-cluster functional consistency. Upper left panel shows an example of GO term hierarchy used. Other panels illustrate the UniRef clusters in categories based on their intra-cluster consistency; I (All members have identical GO terms), II-1 (all members share common GO terms and some have additional less or equally specific GO terms, not children of the shared GO terms), II-2 (all members share common GO terms and some have additional more specific GO terms), III (only some members share common GO terms but all member’s GO terms can be traced to a common non-root parent GO term, is a child of one of the shared GO terms) and IV (members do not have any common GO term and the existing ones cannot be traced to a common non-root parent GO term)

Fig. 2.

Fig. 2.

Example UniProtKB/Swiss-Prot (query) and UniProtKB (target) pairs for distant similarity detection analysis, where Pfam domains common to query and targets span more than 80% of target protein sequences

Fig. 3.

Fig. 3.

Growth of UniRef databases and UniProt Knowledgebase

Fig. 4.

Fig. 4.

The size distribution of UniRef clusters follows a power law distribution

Fig. 5.

Fig. 5.

Distribution of UniRef90 clusters specificity for complete set of clusters (top bars) and those containing only model organisms (bottom bars). UniRef50 clusters follow similar distribution

Fig. 6.

Fig. 6.

Precision and recall (Equations (2) and (3)) of UniRef50-based BLASTP searches expanded using cluster memberships at different _e_-value thresholds

Fig. 7.

Fig. 7.

The percentage difference in distant similarities detected by UniRef50- versus UniProtKB-based searches based on the dataset constructed using Pfam domains

Fig. 8.

Fig. 8.

ROC50 values for UniRef50- versus UniProtKB-based searches based on the dataset constructed using Pfam domains

Similar articles

Cited by

References

    1. Altschul S.F., et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
    1. Ashburner M., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25, 25–29. - PMC - PubMed
    1. Cameron M., et al. (2007) Clustered sequence representation for fast homology search. J. Comput. Biol., 14, 594–614. - PubMed
    1. Capone G., et al. . (2010) The oligodeoxynucleotide sequences corresponding to never-expressed peptide motifs are mainly located in the non-coding strand. BMC Bioinformatics, 11, 383. - PMC - PubMed
    1. Capriotti E., Altman R.B. (2011a) Improving the prediction of disease-related variants using protein three-dimensional structure. BMC Bioinformatics, 12(Suppl. 4), S3. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources