Most of the Human Genome Is Transcribed (original) (raw)

  1. Gane Ka-Shu Wong1,3,
  2. Douglas A. Passey1, and
  3. Jun Yu1,2
  4. 1University of Washington Genome Center, Department of Medicine, Fluke Hall, M/C 352145, Seattle, Washington 98195, USA;2Genomics and Bioinformatics Center, Chinese Academy of Science, Beijing, People's Republic of China

Initial sequence annotations of the human genome have uncovered at least 32,000 genes (International Human Genome Sequencing Consortium 2001), or 26,000–39,000 genes (Venter et al. 2001). The mean gene size is thought to be 27 kb. Although these gene count estimates are acknowledged, by the authors themselves, to be very conservative, they are not significantly smaller than a recent estimate of 35,000 genes (Ewing and Green 2000) that was derived from a “proven” sampling technique. However, these gene count estimates are significantly smaller than the previously accepted estimates of 70,000 genes (Antequera and Bird 1993; Fields, et al. 1994). Suppose we accept the new gene counts, compromising between the two papers and settling at 35,000 genes. Let us further assume a euchromatic genome-size of 2.9 Gb. This would imply that only 33% of the genome is transcribed, and the remaining 67% is intergenic DNA between the genes.

The amount of intergenic DNA so computed contradicts an assertion (Wong, et al. 2000) that most of the human genome is transcribed. We believe that this assertion is still correct. Interestingly, the discrepancy is not due to these newer but smaller gene counts. The problem arises from the mean gene sizes, which everyone significantly underestimates because of sampling biases resulting from the lack of large genomic contigs. We note that 25%, 50%, and 75% of the public consortium's genome sequence is in contigs of sizes <21.0, <84.5, and <290.5 kb, respectively. In contrast, human genes can be much larger than these contigs. For example, the dystrophin gene on chromosome X is 2.3 Mb. The neurexin-3 gene on chromosome 14 is 1.46 Mb, and one intron is 479 kb. It is impossible to determine the correct size of a large gene when its exons are scattered among smaller contigs. Insofar as estimates of …