Late-Night Thoughts on the Sequence Annotation Problem (original) (raw)

  1. Mark S. Boguski1,3
  2. 1Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894 USA; 2Department of Molecular Biology and Genetics, The Johns Hopkins University School of Medicine, Baltimore, Maryland 21205 USA

The reader of James Joyce’s Portrait of the Artist as a Young Man (1992) is aided by editor’s notes illuminating the meaning of unfamiliar words, for example, “greaves in his number” means simply “shinguards in his locker” and “a cod” is a joke or prank. Such minimal explanatory notes are essential, especially for a novice Joyce reader; however, overly detailed comments can be misleading and can stifle the reader’s own interpretation of the work.

During the current scale-up phase of human genome sequencing, production groups have been experimenting with various types and levels of annotation. Precedents for the biological annotation of sequence records in GenBank, however, come from qualitatively and quantitatively different types of sequences from those currently being produced. Historically, the first type of sequence record is that of the “functionally cloned” gene, which is the end product of often years of investigation that began with a particular biological problem in mind. There is usually a one-to-one correspondence between these records and peer-reviewed publications. The second type of sequence record might be described as the result of a population study where many isolates of a particular gene are determined for the purpose of detecting and interpreting variations. Examples include ribosomal genes used to study molecular phylogeny, HIV sequences used to study antigenic variation, and, most recently, copies of human genes from different individuals used to detect sequence polymorphisms for the development of genetic markers. For this second class of sequence data, a multiple alignment is often the most meaningful and appropriate type of annotation. Literature citations to published articles also usually accompany this type of database record. A third major class of GenBank sequences consists of single-pass expressed sequence tags …