A phylogenomic gene cluster resource: the Phylogenetically Inferred Groups (PhIGs) database - PubMed (original) (raw)

A phylogenomic gene cluster resource: the Phylogenetically Inferred Groups (PhIGs) database

Paramvir S Dehal et al. BMC Bioinformatics. 2006.

Abstract

Background: We present here the PhIGs database, a phylogenomic resource for sequenced genomes. Although many methods exist for clustering gene families, very few attempt to create truly orthologous clusters sharing descent from a single ancestral gene across a range of evolutionary depths. Although these non-phylogenetic gene family clusters have been used broadly for gene annotation, errors are known to be introduced by the artifactual association of slowly evolving paralogs and lack of annotation for those more rapidly evolving. A full phylogenetic framework is necessary for accurate inference of function and for many studies that address pattern and mechanism of the evolution of the genome. The automated generation of evolutionary gene clusters, creation of gene trees, determination of orthology and paralogy relationships, and the correlation of this information with gene annotations, expression information, and genomic context is an important resource to the scientific community.

Discussion: The PhIGs database currently contains 23 completely sequenced genomes of fungi and metazoans, containing 409,653 genes that have been grouped into 42,645 gene clusters. Each gene cluster is built such that the gene sequence distances are consistent with the known organismal relationships and in so doing, maximizing the likelihood for the clusters to represent truly orthologous genes. The PhIGs website contains tools that allow the study of genes within their phylogenetic framework through keyword searches on annotations, such as GO and InterPro assignments, and sequence similarity searches by BLAST and HMM. In addition to displaying the evolutionary relationships of the genes in each cluster, the website also allows users to view the relative physical positions of homologous genes in specified sets of genomes.

Summary: Accurate analyses of genes and genomes can only be done within their full phylogenetic context. The PhIGs database and corresponding website http://phigs.org address this problem for the scientific community. Our goal is to expand the content as more genomes are sequenced and use this framework to incorporate more analyses.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Flowchart of the PhIGs process. This is a graphical overview of the pipeline for processing gene models from many genomes into the PhIGs analysis.

Figure 2

Figure 2

Illustration of the clustering method. The tree shown on the left side of the figure indicates the evolutionary relationships among several hypothetical organisms, four from Clade A, two from Clade B, and one that is an outgroup. The right side of the figure illustrates a protein distance graph with circles representing proteins colored to conform to each organism, with the spatial distance of the circles proportional to their sequence distance. The cluster is created by identifying a pair of sequences (a seed) that is the shortest distance from any Clade A protein to any Clade B protein. The cluster is then grown by adding all proteins that have a shorter distance than the seed until no additions can be made. The blue cloud represents one such cluster. See text for more details.

Figure 3

Figure 3

An example phylogenetic tree. This is one output of the PhIGs analysis that is shown on the Cluster View webpage. Instead of simply listing the members of a cluster, a phylogenetic tree is created showing the evolutionary relationships of this multigene family. In this example, we can see that this family had gene duplication events at the base of vertebrates and in the fish lineage. Because the branch lengths are proportional to the rate of amino acid substitutions, we can see how rates of evolution have varied.

Figure 4

Figure 4

An example Synteny Map. Genes ranging from number 205 through 301 on chicken chromosome 2 (numbered as they occur from the p-telomere to q-telomere along the chromosome) are shown as rectangles in the center of the diagram. On the left and right are the orthologs of these genes found in the human and mouse genomes as determined by the PhIGs analysis, shown as they are arranged. Black connecting lines join orthologs in the same relative transcriptional orientation whereas red lines indicate those that are inverted. Blue rectangles indicate intervening genes without identified orthologs in the genomes being compared. Cyan rectangles that do not have connecting lines, as can be seen for a portion of mouse chromosome 2, indicate that orthologs exist in chicken (the query genome), but not in the portion specified for this page.

Similar articles

Cited by

References

    1. Eisen JA. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 1998;8:163–167. - PubMed
    1. Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool. 1970;19:99–113. - PubMed
    1. Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J. Preservation of duplicate genes by complementary, degenerative mutations. Genetics. 1999;151:1531–1545. - PMC - PubMed
    1. Lynch M, Force A. The probability of duplicate gene preservation by subfunctionalization. Genetics. 2000;154:459–473. - PMC - PubMed
    1. Gaucher EA, Gu X, Miyamoto MM, Benner SA. Predicting functional divergence in protein evolution by site-specific rate shifts. Trends Biochem Sci. 2002;27:315–321. doi: 10.1016/S0968-0004(02)02094-7. - DOI - PubMed

Publication types

MeSH terms

LinkOut - more resources