A draft annotation and overview of the human genome - PubMed (original) (raw)
. 2001;2(7):RESEARCH0025.
doi: 10.1186/gb-2001-2-7-research0025. Epub 2001 Jul 4.
W J Lemon, W D Zhao, R Sears, D Zhuo, J P Wang, H Y Yang, T Baer, D Stredney, J Spitzner, A Stutz, R Krahe, B Yuan
Affiliations
- PMID: 11516338
- PMCID: PMC55322
- DOI: 10.1186/gb-2001-2-7-research0025
A draft annotation and overview of the human genome
F A Wright et al. Genome Biol. 2001.
Abstract
Background: The recent draft assembly of the human genome provides a unified basis for describing genomic structure and function. The draft is sufficiently accurate to provide useful annotation, enabling direct observations of previously inferred biological phenomena.
Results: We report here a functionally annotated human gene index placed directly on the genome. The index is based on the integration of public transcript, protein, and mapping information, supplemented with computational prediction. We describe numerous global features of the genome and examine the relationship of various genetic maps with the assembly. In addition, initial sequence analysis reveals highly ordered chromosomal landscapes associated with paralogous gene clusters and distinct functional compartments. Finally, these annotation data were synthesized to produce observations of gene density and number that accord well with historical estimates. Such a global approach had previously been described only for chromosomes 21 and 22, which together account for 2.2% of the genome.
Conclusions: We estimate that the genome contains 65,000-75,000 transcriptional units, with exon sequences comprising 4%. The creation of a comprehensive gene index requires the synthesis of all available computational and experimental evidence.
Figures
Figure 1
Overview map of features on the entire human genome, based on the working draft assembly (15 June 2000 release) and finished sequences for chromosomes 21 and 22. Ideograms are oriented with the p-arm at the top, and are assembly-corrected to form an approximate cytogenetic alignment with the features of the draft assembly depicted to the right of each ideogram. Sequencing gaps at the centromeres and contiguous heterochromatic regions are represented by horizontal lines. Chromosome 19 is an exception, for which evidence suggests that both heterochromatic regions are at least partially sequenced. Genomic features are presented as densities (that is, proportion of base pairs occupied by each feature) in nonoverlapping 1Mb intervals. The densities are corrected for sequencing gaps, indicated in the draft assembly as 50-200 kb segments of Ns (unsequenced nucleotides), but (with the exception of GC content) are not corrected for sporadic Ns of lower-quality base calls, because these would not interfere with assignment of the feature to the assembly. Exon density (red) is based on high-scoring pairs from Table 1, not necessarily in ORFs. CpG island density (blue) is based on standard definitions [45] of a run of at least 200 bases with GC content >50% and observed over expected CpG >0.6, and implemented using the program cpg [90]. GC content (green) is the number of G or C bases divided by the number of non-N bases in the 1Mb interval. LINE1 (blue) and Alu (black) repeat elements were determined using RepeatMasker [91] and minisatellites of repeat size 20-50bp by the etandem program of the EMBOSS suite [84]. Density ranges were selected to illuminate features across the genome while preserving a common scale to facilitate comparison. A number of values exceed the range for the feature and are truncated, with a small dot of the corresponding color placed under the ordinate. The data points for the figure are available in the additional data file.
Figure 2
Coding sequence density for human chromosomes. (a) The proportion of assembled sequence that is exonic provides direct confirmation of previously hypothesized patterns of gene density. (b) Transcriptional units per megabase. Additional plots and data are in the additional data files.
Figure 3
Total number of embryo-specific genes (based on HINT clusters) for each chromosome. Chromosomes 13, 18, 21 and Y clearly have lower numbers than other chromosomes.
Figure 4
The correspondence between physical location and maps constructed using different mapping methods. (a) Correspondence between the genetic map and physical location. (b) Correspondence between radiation hybrid maps versus physical location. The GB4 (black) radiation hybrid map shows a jump at the centromere, reflecting a sequencing gap and possible increased radiation sensitivity in the region. The jump for the Stanford G3 map (blue) is not easily estimated and is suppressed in the published map. Chromosome 1 is shown here for illustration, and the corresponding figures and data points for the entire genome are available in the additional data files.
Figure 5
Repeat-masked chromosome sequences were divided into 1Mb segments and analyzed against the entire chromosomal sequence. Matches of at least 70% identity (both forward and reverse) and E < 10-25 are plotted. The diagonal line of complete identity has been removed to clarify features near the diagonal. Plots for each chromosome are available in the additional data files.
Similar articles
- Assembly, annotation, and integration of UNIGENE clusters into the human genome draft.
Zhuo D, Zhao WD, Wright FA, Yang HY, Wang JP, Sears R, Baer T, Kwon DH, Gordon D, Gibbs S, Dai D, Yang Q, Spitzner J, Krahe R, Stredney D, Stutz A, Yuan B. Zhuo D, et al. Genome Res. 2001 May;11(5):904-18. doi: 10.1101/gr.gr-1645r. Genome Res. 2001. PMID: 11337484 Free PMC article. - An Experimental Approach to Genome Annotation: This report is based on a colloquium sponsored by the American Academy of Microbiology held July 19-20, 2004, in Washington, DC.
[No authors listed] [No authors listed] Washington (DC): American Society for Microbiology; 2004. Washington (DC): American Society for Microbiology; 2004. PMID: 33001599 Free Books & Documents. Review. - Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map.
Flicek P, Keibler E, Hu P, Korf I, Brent MR. Flicek P, et al. Genome Res. 2003 Jan;13(1):46-54. doi: 10.1101/gr.830003. Genome Res. 2003. PMID: 12529305 Free PMC article. - Ontology annotation: mapping genomic regions to biological function.
Thomas PD, Mi H, Lewis S. Thomas PD, et al. Curr Opin Chem Biol. 2007 Feb;11(1):4-11. doi: 10.1016/j.cbpa.2006.11.039. Epub 2007 Jan 5. Curr Opin Chem Biol. 2007. PMID: 17208035 Review.
Cited by
- Alternative Transcripts Diversify Genome Function for Phenome Relevance to Health and Diseases.
Carrion SA, Michal JJ, Jiang Z. Carrion SA, et al. Genes (Basel). 2023 Nov 8;14(11):2051. doi: 10.3390/genes14112051. Genes (Basel). 2023. PMID: 38002994 Free PMC article. Review. - Small RNAs, Big Diseases.
Rzeszutek I, Singh A. Rzeszutek I, et al. Int J Mol Sci. 2020 Aug 9;21(16):5699. doi: 10.3390/ijms21165699. Int J Mol Sci. 2020. PMID: 32784829 Free PMC article. Review. - A high-quality genome assembly from a single, field-collected spotted lanternfly (Lycorma delicatula) using the PacBio Sequel II system.
Kingan SB, Urban J, Lambert CC, Baybayan P, Childers AK, Coates B, Scheffler B, Hackett K, Korlach J, Geib SM. Kingan SB, et al. Gigascience. 2019 Oct 1;8(10):giz122. doi: 10.1093/gigascience/giz122. Gigascience. 2019. PMID: 31609423 Free PMC article. - Tales from topographic oceans: topologically associated domains and cancer.
Campbell MJ. Campbell MJ. Endocr Relat Cancer. 2019 Nov;26(11):R611-R626. doi: 10.1530/ERC-19-0348. Endocr Relat Cancer. 2019. PMID: 31505466 Free PMC article. Review. - Loose ends: almost one in five human genes still have unresolved coding status.
Abascal F, Juan D, Jungreis I, Kellis M, Martinez L, Rigau M, Rodriguez JM, Vazquez J, Tress ML. Abascal F, et al. Nucleic Acids Res. 2018 Aug 21;46(14):7070-7084. doi: 10.1093/nar/gky587. Nucleic Acids Res. 2018. PMID: 29982784 Free PMC article.
References
- International Human Genome Consortium http://www.nhgri.nih.gov/genome_sequence.html
- Venter JC, Adams MD, Sutton GG, Kerlavage AR, Smith HO, Hunkapiller M. Shotgun sequencing of the human genome. Science. 1998;280:1540–1542. - PubMed
- TIGR Microbial Database http://www.tigr.org/tdb/mdb/mdbcomplete.html
- Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Ama-natides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. - PubMed
- The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815. - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources