Genomics: From Phage to Human (original) (raw)

1.1. The Humble Beginnings …

The first genome, that of RNA bacteriophage MS2, was sequenced in 1976, in a truly heroic feat of direct determination of an RNA sequence [225]. This was followed by the genome of bacteriophage ϕX174, the first triumph of the new, rapid sequencing methods developed in the laboratories of Walter Gilbert and Fred Sanger [553,743]. These are some of the smallest known genomes with only four and ten genes, respectively. Then, in 1982, the last paper published by Sanger before he retired, announced the first relatively large genome to be sequenced, that of bacteriophage λ, probably the most famous model system of classic molecular biology [742]. Phage λ has 48,502 bases of genomic DNA and ~70 known and predicted protein-coding genes and 23 RNA-coding genes. At 70 characters per line and 43 lines per page, this sequence alone would take over 16 pages of this book. However, the listing of the λ protein-coding genes (Table 1.1) fits into just two pages and definitely conveys more information. These days, it may be hard to imagine all the excitement felt by molecular biologists 20 years ago when the λ genome was finally finished. Nevertheless, even in this era of high-throughput methods, it could be instructive to look back and address several questions: (i) is λ genome a good model of the subsequently sequenced prokaryotic and eukaryotic genomes? (ii) how accurate was the sequence itself and the original gene assignment? and (iii) how much more have we learned about functions of λ genes in the past 20 years?

Table 1.1

Protein-coding genes of bacteriophage λ.

The answer to the first question is definitely yes: λ genome has many features common to the genomes of cellular life forms, particularly prokaryotes. Most of the genome consists of protein-coding genes. Adjacent genes are often transcribed in the same direction and encode proteins that have similar functions and/or interact with each other (e.g. cell lysis proteins, tail components). Adjacent genes either slightly overlap or are separated by intergenic regions of varying length, typically much shorter than the genes themselves.

To answer the second question, both the sequence and gene assignments turned out to be essentially correct. The latter may not be surprising since the λ genome was annotated by researchers who had studied the phage for years, on the basis of the entire body of knowledge amassed by that time. In contemporary genome sequencing projects, such detailed analysis by highly qualified biologists with intimate knowledge of the biology of the given organism is more an exception, rather than the norm, partly because biological information on many of the sequenced organisms is simply too scarce.

A comparison of Table 1.1 with the original paper by Sanger et al. [742] shows that there is actually not much to add to the gene annotations. The use of recently developed sophisticated gene prediction programs, such as Glimmer (see 4.1), coupled with the analysis of the regions that are conserved between lambda and related bacteriophages, led to the conclusion that certain intergenic regions might contain additional protein-coding genes (marked by asterisks in the Table 1.1). Unfortunately, most of these genes remain uncharacterized, and it is not even known whether they are ever expressed. It is worth noting that exactly the same doubts exist about the possible functions and/or expression of a large number of so-called “hypothetical” genes, identified in the genomes of cellular life forms by essentially the same two principal approaches (see 4.1).

When reading the Sanger paper now, 20 years after it appeared, one is struck by the absence of any analysis of protein sequences in this detailed, thorough work. Although the authors have done careful computational analysis of open reading frames, particularly the likely translation starts and codon usage, the very word “homolog” is not used in the article, and there is no mention of any search of protein sequence databases, something that these days is, by default, an integral part of any genomic study. Not that protein sequence databases did not exist at the time: the first one, the Protein Identification Resource, was launched by Margaret Dayhoff, one of the great pioneers of computational biology, in 1965, long before genomics had even become conceivable [172,173]. However, reliable and rapid methods for searching this database still have not been developed, and more generally, database search was not a part of the culture in molecular biology at the time. And for a good reason, too. Had Sanger and his coworkers performed a PIR search, even using the methods available in 2002, they would not have found anything of interest because the sequences available at that time were few and far apart, and there were no homologs of phage λ proteins among them. Clearly, the time was not ripe for comparative genomics and, in a sense, for genomics itself because, as we will see throughout this book, the comparative approach is truly central to the genomic enterprise.

Revisiting phage λ genome after 20 years, we see a completely different “genomescape”. Using the PSI-BLAST program (see 4.3), the search of the complete non-redundant protein sequence database maintained at the NCBI (National Center for Biotechnology Information, a division of the National Institutes of Health in Bethesda, Maryland, USA) for homologs of the 73 proteins listed as gene products of phage λ takes about an hour on a moderate power computer. Another hour was spent running selected proteins through the conserved domain search using the CDD option of the NCBI’s BLAST server (see 4.4). Of course, we could have scoured the literature for descriptions of computational analyses of λ proteins instead. However, extracting the relevant information from databases, such as PubMed (see 3.7), is far from trivial because, in most cases, the papers including this information dealt with more general issues and would not have λ, let alone a particular gene, mentioned in the title or abstract. Running the searches anew was much faster and easier. Besides, sequence databases are growing daily, which may substantially affect the results of searches and might even lead to new discoveries. Perusing the results, we should note that, with a few exceptions, there are now homologs readily detectable for the phage proteins. In the majority of cases, these are proteins from other related phages (sometimes integrated as prophages into the bacterial chromosome). However, 12 λ proteins show conservation in bacteria, archaea, and eukaryotes (Table 1.2). For several of these proteins whose functions have not been studied experimentally, non-trivial functional predictions become possible.

Table 1.2

Non-trivial evolutionary connections and functional predictions for bacteriophage λ proteins.

It is remarkable that some of the more interesting computational predictions remain without experimental test. Admittedly, the visibility of molecular biology of bacteriophages as a research field has not increased since the 1970’s, and the funds have pretty much tapered off. Good examples are the Ea59 and K genes that are predicted to encode an ATPase and a metal-dependent protease, respectively. Both are clear and readily testable predictions that have been described in print, even if briefly [296,679]. However, to our knowledge, no experimental tests of these predictions have been reported so far. Interestingly, an observation has been made during these searches that actually seems to have a novel aspect to it. The Ea31 protein was shown to contain a metal-dependent nuclease domain [50]. The stop codon of the Ea31 gene overlaps the start codon of Ea59, leading to the intriguing hypothesis that the two proteins interact and form an ATP-dependent nuclease complex. We discuss sequence analysis of Ea31 in greater detail in Chapter 4 to illustrate the process of discovery in database searches. Furthermore, this is a little example of context analysis, an increasingly important direction in genome annotation, which is covered in Chapter 5. This situation is not uncommon: computational analysis of genomes keeps yielding interesting functional predictions, even years after the publication of the sequence; what is most often lacking is systematic experimental testing of these predictions.

We will come back to this dramatic rift between computational and experimental analysis of most, if not all, genomes with more numbers, but first let us step back and have a quick look into the history of genomics, which is short, but dynamic (Table 1.3). By definition, genomics requires genome sequences, and to engage in comparative genomics, one needs at least two genomes to compare. In a close analogy to the history of molecular genetics, which owes most of its early progress to bacteriophages used as model systems, comparative genomics was first practiced with the genomes of viruses. These are several orders of magnitude smaller than even the tiniest bacterial genomes and, in case a virus grows well, sequencing of viral genomes became a relatively straightforward enterprise in the early 1980’s. By 1983, six years after the beginning of the sequencing era, a considerable number of complete genomes of diverse small viruses of plants, animals, and bacteria (bacteriophages) had been amassed, and the time was ripe for the birth of comparative genomics.

Pinpointing the exact beginning of comparative genomics may be difficult. In a sense, one may say that it was born as soon as there were two genomes to compare, i.e. in 1977 when the genome of phage ϕX174 was sequenced and could be compared with the already available sequence of the RNA phage MS2. However, this was a vacuous start because the two phages had virtually nothing in common (a propos, this has not changed in 20 years: for all we know, these phage families are truly unrelated). It seems that comparative genomics had a real head start with two astonishing discoveries that caught most, if not all, virologists utterly by surprise. First, it has been shown that RNA-containing retroviruses (causative agents of certain leucoses in animals and humans and, as shown later, of AIDS) shared a conserved replicative enzyme, the reverse transcriptase, with two groups of DNA viruses, the hepadnaviruses (including the medically important hepatitis B virus) and caulimoviruses, infecting plants [847]. Second, it turned out that small RNA viruses infecting animals (picornaviruses, such as polio and foot-and-mouth disease) and those infecting plants (cowpea mosaic virus) shared not only significant sequence similarity that allowed the identification of homologous (orthologous) genes, but also, in part, the order of these genes in their genomes [7,56,335]. Subsequent systematic studies have revealed a complex network of homologous relationships within the vast classes of positive-strand RNA viruses and negative-strand RNA viruses. Although still disputed, the concept emerged that each of these classes was monophyletic, that is, probably evolved from a common ancestral virus [460]. These studies combined two elements that were crucial in defining the identity of the emerging discipline of comparative and evolutionary genomics.

Firstly, the objects of analysis were complete genomes, however small, rather than individual genes, and accordingly, the notions of conservation of gene order and gene shuffling became important. Secondly, the discoveries made through these genome comparisons were completely unexpected; there was no experimental data that would prepare researchers for the startling unity of superficially unrelated viruses.

In retrospect, it is somewhat ironic that comparative genomics had to start with virus genomes (due to the experimental contingency) because viral proteins tend to evolve extremely fast, and detection of conservation between distant viruses may be a non-trivial task, even with advanced methods of computational sequence analysis, let alone with those available in the early 1980’s. This was a challenge and perhaps a blessing in disguise. The difficulty of detecting sequence conservation among viral proteins prompted those who ventured into this area to employ approaches that later proved invaluable in comparative genomics and computational biology in general: (i) compare protein sequences, rather than nucleotide sequences directly, whenever distant relationships are involved and sensitivity is an issue; (ii) rely on multiple, rather than pairwise, comparisons; (iii) search for conserved patterns or motifs in multiple sequences; and, above all (iv) actually look at sequences (and structures whenever these are available) and think about the potential relationships in an effort to synthesize all relevant shreds of information. This practice has been dubbed, more or less pejoratively, “sequence gazing” [341]. Sure enough, sequence and structure comparisons are prone to error and, worse, to fantasy, and these dangers had been particularly grave in the early days, before the statistical foundations of computational biology had been worked out and the rules of thumb had been established through accumulated practices. There is no doubt, however, that success stories of computational prediction of gene functions have been of much greater import and have, to a large extent, determined the very feasibility of the further progress of genomics.

The first comparative-genomic study of a larger scale, investigating the relationships between genomes that contained >100 genes each, came in 1986 [558]. The newly sequenced genome of varicella zoster virus was carefully compared to the previously sequenced Epstein-Barr virus genome (the original Epstein-Barr genome paper [68] resembled the λ work in that no homologs were reported for any of the viral proteins because, indeed, none were to be easily identified among the sequences then available). This work, though little noticed outside virology, already included the principal elements of the comparative-genomic approach, if not the actual methods.

1.2. … and the Astonishing Progress of Genome Sequencing

Comparative genomics of cellular life forms is in a way a “by-product” of the Human Genome Project. Probably the greatest insight of the leaders of the early stages of this project was the realization that, in isolation, the human genome would be a costly but uninterpretable string of three billion or so of A’s, T’s, G’s and C’s. Only through systematic comparisons to other genomes may we hope to make sense of the text of this “Book of Life”. As far as genomics is concerned, Theodosius Dobzhansky’s famous dictum “Nothing in biology makes sense except in the light of evolution” is not some kind of evolutionist propaganda, but an entirely literal and more or less routine description of the situation. And so, in the last decade of the second millennium, the genome sequences started pouring in. Yeast chromosome III, the first respectable chunk of contiguous genome sequence [629] that became available in 1992 (quite modest, by today’s standards, just ~320,000 base pairs), generated major excitement epitomized in the title of a Nature note describing a re-analysis of the ORFs from this chromosome: “What’s in the genome?” [105]. From the analysis of this sequence and other large genome segments that started to appear in the next months, at least two notions were derived that became critical for the subsequent evolution of comparative genomics: (i) there were many more genes in the genome than anyone suspected previously on the basis of genetic or biochemical experiments; and (ii) methods of computational analysis matter—careful analysis employing multiple complementary approaches yields incomparably more information on gene functions and evolutionary relationships than any single automatic procedure.

The appearance in August 1995 of the complete genome sequence of the parasitic bacterium Haemophilus influenzae [232] ushered in the era of “real” genomics, the study of complete genomes of cellular organisms. The acceleration of genome sequencing required for this to happen was greatly facilitated by the whole-genome shotgun approach pioneered by Craig Venter, Hamilton Smith, and Leroy Hood [871]. Systematic comparative approaches were tried immediately, even before the second genome came, by using the largely finished genome of Escherichia coli [829]. Since that point, complete genomes of bacteria and archaea have been arriving at a steady rate, which seems to be accelerating in the 3rdmillennium (Figure 1.1). Starting with the second genome sequencing paper [242], reports on new genomes inevitably became comparative-genomic studies because, as we have already mentioned, that is the only way to even start understanding “what’s in the genome”.

Figure 1.1

Growth of the number of completely sequenced genomes. The data are from Table 1.4. The 2002 figure is extrapolated from the 5-month results.

By June 1, 2002, genomes of 73 species of unicellular organisms (55 bacterial species, 16 archaea, and 2 eukaryotes) were completely sequenced and available in public databases. In the three parts of Table 1.4, the completely sequenced bacterial, archaeal, and eukaryotic genomes are listed in the order of decreasing size. The largest prokaryotic genomes (Streptomyces coelicolor among bacteria, Methanosarcina acetivorans among the archaea) have been sequenced only recently, which promises many interesting discoveries yet to come.

Table 1.4

Completely sequenced genomes (as of June 1, 2002).

By the time of this writing (August 2002), the first genomes of multicellular eukaryotes, the nematode worm Caenorhabditis elegans, the fruit fly_Drosophila melanogaster_, the thale cress Arabidopsis thaliana, the pufferfish Fugu rubripes, and_Homo sapiens_ have been nearly completed (let us note that the very concept of a complete genome sequence for these organisms differs from that for prokaryotes and unicellular eukaryotes). At least 100 more prokaryotic genomes and many eukaryotic genomes, including those of mouse and rat, were at different stages of completion. Beyond doubt, many more finished or nearly finished genome sequences exist in proprietary databases maintained by biotech companies, but since these cannot be freely analyzed, they do not count inasmuch as comparative genomics is discussed.

Any list of completed genomes rapidly becomes outdated and so will Table 1.4, even as this book appears in print. Periodically updated listings of both finished and unfinished publicly funded genome sequencing projects are available at the web sites maintained at the Institute for Genomic Research (TIGR, http://www.tigr.org/tdb/mdb/mdb.html) and at the NCBI (http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html). The Chicago-based Integrated Genomics Inc. maintains Genomes OnLine Database (http://wit.genomesonline.org), which lists most public as well as some private projects. In addition, web sites of the genome sequencing centers list the projects run or planned in those particular institutions (see Appendix 2).

The relative ease of 6- to 8-fold coverage sequencing as compared to finishing and genome annotation resulted in the availability of a number of incomplete genomes, which are not going to be finalized any time soon (see, for example, the web site of the Department of Energy Joint Genome Institute, http://www.jgi.doe.gov/JGI_microbial/html/index.html). These sequences are a treasure trove for someone who knows what to look for. Most of the data are available for searching through the NCBI BLAST page at http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/genom_table_cgi or through the web sites of the respective sequencing centers. A partial list of the major genome sequencing centers is available in Appendix 2. Of course, as new genome sequencing centers appear on the map, this listing is going to become obsolete, too. For updated listings of such centers, one could look at the web sites of NCBI (http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/links.html) or the National Human Genome Research Institute (http://www.genome.gov).

In addition to the whole-genome sequencing projects, there are many large-scale expressed sequence tags (EST) sequencing projects, aimed at collecting partial mRNA sequence data from eukaryotic organisms that have not yet made it to the list of priority targets for complete sequencing.

1.3. Basic Questions of Comparative Genomics

In the subsequent chapters of this book, we address many specific problems in comparative and evolutionary genomics. Right now, however, it makes sense to address some basic questions, the answers to which, as we believe, define the status of this research area.

How good is our current collection of genome sequences? Or, more precisely, how representative is it of the actual diversity of life forms? To address this issue, one has to superimpose the sequenced genomes over the taxonomy tree and see how densely populated the main branches are. When this is done with the prokaryotic part of the taxonomy, the result seems to be rather encouraging: the main bacterial and archaeal lineages are already represented by either a complete genome sequence or a genome project that is nearing completion (Table 1.5). However, this needs to be taken with a grain of salt because our knowledge of prokaryotic diversity is itself quite incomplete. Environmental molecular evolutionary studies indicate that the great majority of bacterial and archaeal species is uncultivable with the current methods [644]. Recent techniques aimed at growing these organisms [411] might eventually result in a real revolution in microbial genomics, but it will take years to unfold. Most of those species whose rRNA sequences are produced by environmental cloning fall within known bacterial and archaeal lineages, suggesting that we have already sampled most of the prokaryotic diversity. However, this argument is somewhat circular because we have no idea how many prokaryotes might be not only uncultivable but also unclonable, even with the most non-specific set of PCR primers that have been tried. A case in point is the recent report of a new archaeal phylum, the Nanoarchaea [362]. With these caveats, it is fair to say that, to the best of our knowledge, the diversity of prokaryotes is reasonably well covered by genome sequences, and hence, the stage is set for prokaryotic evolutionary genomics.

Table 1.5

Coverage of the main prokaryotic phyla by genome projects.

The situation with eukaryotes is different in that we seem to have a better grasp of the true eukaryotic diversity and realize that the available set of genome sequences is by no means representative (Table 1.6). While certain groups (ascomycetes, nematodes, insects, mammals) are being tackled by multiple genome projects, most of the early branching eukaryotic lineages are not represented among the sequenced genomes, and neither are most of the animal and plant phyla, including such important groups as sponges, coelencerates, and segmented (annelid) worms. Certainly, this is no reason to postpone detailed comparative-genomic analysis, but this insufficiency of genomic data needs to be taken into account when conclusions are made on eukaryotic evolution.

Table 1.6

Status of the eukaryotic genome projects.

The next question that we have to address is: Why does comparative genomics work to give us information on gene functions and evolution? The general answer is provided by the neutral theory of molecular evolution [440]. Neutral evolution is fast, as convincingly demonstrated, for example, by the rapid deterioration of pseudogene sequences. Therefore, whenever we detect sequence conservation among proteins or nucleic acids from species separated by a long span of evolution (and this, in practical terms, involves any comparison between two species because these are typically separated by millions of years, time more than sufficient for a pseudogene to change beyond recognition), we can be sure that this conservation is due to the pressure of purifying selection driven by functional constraints. To put it in even simpler terms, what is conserved in a sequence is functionally important . Furthermore, and less trivially, the conserved amino acids and nucleotides almost always perform the same or similar functions, at least in structural and biochemical terms, in homologous protein, RNA, or DNA molecules.

These general concepts of molecular evolution indicate that comparative genomics is likely to be informative in principle, but they tell us nothing about the evolutionary distances at which it is expected to work. The theory would not have been violated in any way if only homologs from closely related species showed significant sequence similarity. However, it had been known already in the pre-genomic era that certain proteins are highly conserved even between vertebrates and bacteria, and the very first genome comparisons revealed deep evolutionary conservation for the majority of proteins. When state of the art methods for sequence comparison are applied, homologs from more than one distantly related species are detectable for 70-80% of the proteins encoded in any prokaryotic genome [827]. At present this fraction seems to be somewhat lower for some of the eukaryotes, but only because the taxonomic density of genome sequencing so far has been insufficient. Indeed, in the genomes of humans and mice, species that diverged from their common ancestor 80-100 million years ago, nearly all genes are conserved. These crucial facts show that genome comparisons are likely to reveal important information on the functions and evolutionary relationships of the great majority of genes in any genome.

We have already stated that genomics would not make any sense at all without the possibility of informative genome comparison. Why is this so? In principle, one could imagine that a combination of theoretical methods for deciphering a protein’s three-dimensional structure from the sequence and experimental studies would allow functional identification without recourse to evolutionary analysis. However, neither of these approaches is up to the task. Some recent progress notwithstanding, there is no hope that, in the foreseeable future,ab initio methods become capable of correctly predicting the structure of proteins on genome scale (or on any significant scale except, possibly, for some small proteins with simple folds), let alone their functions.

As for genome-wide experimental characterization of protein functions, far-reaching studies have been conducted, such as elucidation of the phenotype of all gene knockout mutants, massive study of subcellular localization, and identification of protein-protein interaction in bulk for yeast S. cerevisiae [714,876]. However, actual determination of the biochemical activity and more so of the biological function of a protein remains a unique task, and even for model organisms such as yeast or E. coli, this goal is not in sight for all gene products.

Indeed, for the great majority of organisms whose genomes have been sequenced, only a few genes have been studied experimentally (Figure 1.2), and there is no hope for substantial progress in the near future.

Figure 1.2

The current state of annotation of some genomes. The data were derived from the original genome sequencing papers [94]. The information on experimentally characterized genes of E. coli is from the GeneProtEC and _E. coli_Proteome databases, the corresponding (more...)

Even for E. coli, the workhorse of molecular genetics for the last 50 years, less than half of the genes have been experimentally characterized. Prior to the completion of the genome of the archaeon M. jannaschii, only four proteins have been characterized in that organism: two flagellins, RadA recombinase and the adenylate kinase (in Figure 1.2, this sector is just not visible).

The availability of the genome sequence spawned efforts to characterize other genes in these organisms, but so far these studies made only a limited contribution. The level of characterization of eukaryotic genomes is not much higher, although post-genomic efforts are improving the understanding of the yeast and nematode proteomes (see 3.5.2).

Under these circumstances, the theory of molecular evolution and, in particular, the simple connection between evolutionary conservation and function outlined above remain the crucial theoretical underpinning and the main methodology of functional genomics. The comparative approach allows researchers to predict protein functions by transferring information from functionally characterized proteins of model organisms to their uncharacterized homologs and to delineate the functionally critical parts of protein (and RNA) molecules, such as catalytic or binding sites. Naturally, the quality of these inferences depends on the sensitivity and robustness of computational methods employed by comparative genomics. These caveats notwithstanding, we will argue that comprehensive comparative analysis of genomic sequences and the proteins they encode is an absolute prerequisite to further advances in our understanding of cell biology. Actually, we tend to believe that comparative genomics is up to something grander, namely prioritization of targets for systematic experimental studies. This approach has been partially realized in structural genomics, and we see no reason why it cannot be profitably applied in functional genomics as well. We will be quite satisfied if this book makes just a small step in this direction.

1.4. Further Reading

Doolittle RF. 1986. Of Urfs and Orfs: A primer on how to analyze derived amino acid sequences. University Science Books, San Diego.

Cairns J, Stent GS, Watson JD. 1992.Phage and the Origins of Molecular Biology. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.

Mount DW. 2000. Bioinformatics: Sequence and genome analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. Chapter 1.

Koonin EV, Dolja VV. Evolution and taxonomy of positive-strand RNA viruses: implications of comparative analysis of amino acid sequences. Critical Reviews in Biochemistry and Molecular Biology. 1993;28:375–430. [PubMed: 8269709]