Small Open Reading Frames: Beautiful Needles in the Haystack (original) (raw)

  1. Munira A. Basrai,
  2. Philip Hieter, and
  3. Jef D. Boeke
  4. Department of Molecular Biology and Genetics, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205

… and a time for all things; a time for great things, and a time for small things.

Miguel de Cervantes (1547–1616)

The completion of genome sequences from model organisms creates new opportunities and resources for both basic and applied research. The genome sequence of several bacterial genomes as well as_Saccharomyces cerevisiae_ represent landmark achievements (Goffeau et al. 1996, 1997). The total genome sequence era offers many opportunities to explore the wealth of information contained within a genome, but it is also one of the most challenging phases for researchers and emphasizes a need for global approaches to study biological problems. One of these challenges is identifying and defining very small protein-coding genes, which can easily escape detection because they are “buried” in an enormous pile of meaningless short ORFs. Yet the subset of small, functional ORFs (here abbreviated smORFs) probably encode very interesting proteins in all organisms, including humans.

The Difficulties of Defining Meaningful smORFs

All long DNA sequences, including random ones, contain many open reading frames (ORFs)1 of 1–99 codons in length; biological sequences also contain many ORFs >99 codons long that correspond to real protein-coding genes. The “gray area” surrounding the ad hoc 100-codon boundary presents two special problems for biologists: (1) ORFs of 100–150 codons include numerous artifactual ORFs (Fickett 1995; Das et al. 1997); and (2) the set of ORFs of 1–99 codons, among which the probability of being biologically meaningless is exceedingly high, nevertheless contains numerous interesting genes, which are easily missed because of the sheer number of small ORFs. To illustrate the magnitude of this problem, we plotted the total number of ORFs in the yeast genome of all lengths between 2 and 1000 codons (Fig. 1)1 of 1–99 codons in length; biological sequences also contain many ORFs >99 codons long that …