A draft annotation and overview of the human genome - PubMed (original) (raw)

. 2001;2(7):RESEARCH0025.

doi: 10.1186/gb-2001-2-7-research0025. Epub 2001 Jul 4.

W J Lemon, W D Zhao, R Sears, D Zhuo, J P Wang, H Y Yang, T Baer, D Stredney, J Spitzner, A Stutz, R Krahe, B Yuan

Affiliations

A draft annotation and overview of the human genome

F A Wright et al. Genome Biol. 2001.

Abstract

Background: The recent draft assembly of the human genome provides a unified basis for describing genomic structure and function. The draft is sufficiently accurate to provide useful annotation, enabling direct observations of previously inferred biological phenomena.

Results: We report here a functionally annotated human gene index placed directly on the genome. The index is based on the integration of public transcript, protein, and mapping information, supplemented with computational prediction. We describe numerous global features of the genome and examine the relationship of various genetic maps with the assembly. In addition, initial sequence analysis reveals highly ordered chromosomal landscapes associated with paralogous gene clusters and distinct functional compartments. Finally, these annotation data were synthesized to produce observations of gene density and number that accord well with historical estimates. Such a global approach had previously been described only for chromosomes 21 and 22, which together account for 2.2% of the genome.

Conclusions: We estimate that the genome contains 65,000-75,000 transcriptional units, with exon sequences comprising 4%. The creation of a comprehensive gene index requires the synthesis of all available computational and experimental evidence.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Overview map of features on the entire human genome, based on the working draft assembly (15 June 2000 release) and finished sequences for chromosomes 21 and 22. Ideograms are oriented with the p-arm at the top, and are assembly-corrected to form an approximate cytogenetic alignment with the features of the draft assembly depicted to the right of each ideogram. Sequencing gaps at the centromeres and contiguous heterochromatic regions are represented by horizontal lines. Chromosome 19 is an exception, for which evidence suggests that both heterochromatic regions are at least partially sequenced. Genomic features are presented as densities (that is, proportion of base pairs occupied by each feature) in nonoverlapping 1Mb intervals. The densities are corrected for sequencing gaps, indicated in the draft assembly as 50-200 kb segments of Ns (unsequenced nucleotides), but (with the exception of GC content) are not corrected for sporadic Ns of lower-quality base calls, because these would not interfere with assignment of the feature to the assembly. Exon density (red) is based on high-scoring pairs from Table 1, not necessarily in ORFs. CpG island density (blue) is based on standard definitions [45] of a run of at least 200 bases with GC content >50% and observed over expected CpG >0.6, and implemented using the program cpg [90]. GC content (green) is the number of G or C bases divided by the number of non-N bases in the 1Mb interval. LINE1 (blue) and Alu (black) repeat elements were determined using RepeatMasker [91] and minisatellites of repeat size 20-50bp by the etandem program of the EMBOSS suite [84]. Density ranges were selected to illuminate features across the genome while preserving a common scale to facilitate comparison. A number of values exceed the range for the feature and are truncated, with a small dot of the corresponding color placed under the ordinate. The data points for the figure are available in the additional data file.

Figure 2

Figure 2

Coding sequence density for human chromosomes. (a) The proportion of assembled sequence that is exonic provides direct confirmation of previously hypothesized patterns of gene density. (b) Transcriptional units per megabase. Additional plots and data are in the additional data files.

Figure 3

Figure 3

Total number of embryo-specific genes (based on HINT clusters) for each chromosome. Chromosomes 13, 18, 21 and Y clearly have lower numbers than other chromosomes.

Figure 4

Figure 4

The correspondence between physical location and maps constructed using different mapping methods. (a) Correspondence between the genetic map and physical location. (b) Correspondence between radiation hybrid maps versus physical location. The GB4 (black) radiation hybrid map shows a jump at the centromere, reflecting a sequencing gap and possible increased radiation sensitivity in the region. The jump for the Stanford G3 map (blue) is not easily estimated and is suppressed in the published map. Chromosome 1 is shown here for illustration, and the corresponding figures and data points for the entire genome are available in the additional data files.

Figure 5

Figure 5

Repeat-masked chromosome sequences were divided into 1Mb segments and analyzed against the entire chromosomal sequence. Matches of at least 70% identity (both forward and reverse) and E < 10-25 are plotted. The diagonal line of complete identity has been removed to clarify features near the diagonal. Plots for each chromosome are available in the additional data files.

Similar articles

Cited by

References

    1. International Human Genome Consortium http://www.nhgri.nih.gov/genome_sequence.html
    1. Venter JC, Adams MD, Sutton GG, Kerlavage AR, Smith HO, Hunkapiller M. Shotgun sequencing of the human genome. Science. 1998;280:1540–1542. - PubMed
    1. TIGR Microbial Database http://www.tigr.org/tdb/mdb/mdbcomplete.html
    1. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Ama-natides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. - PubMed
    1. The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815. - PubMed

Publication types

MeSH terms

LinkOut - more resources