Home | Computational Genomics (original) (raw)
BIRD FLU : Example of sequence statistics and phylogenetic analysis
virus hemagglutinin and neuraminidase molecule are responsible for virulence. In this example we investigate this assumption by considering different segments of the Influenza A virus (H5N1). We also compare segments of H5N1 for several species to investigate the similarity of the virus. Next we compare the virus of one species over different regions to see how the virus is evolved. Comparison are made by phylogenetic analysis.
SARS : Example of phylogenetic analysis
SARS (Severe Acute Respiratory Syndrome) was an illness that first appeared in the second half of 2002 in Guangdong Province (China). The disease is now known to be caused by the SARS coronavirus (SARS CoV), a novel coronavirus. SARS was first reported in Asia (at the Vietnam French Hospital of Hanoi) in February 2003. Over the next few months, the illness spread to several countries all over the world. The origin and diffusion of SARS epidemic is studied in this example. The genome SARS-CoV was completed in April 2003 by a Canadian group of researchers. It is a 29571 base long single stranded RNA sequence. It can be obtained from Genbank database (accession number AY274119.3)
HIV : Example of quantitative analysis of natural selection.
In 1983 the infectious agent responsible of the well-known disease of Acquired Immune Deficiency Syndrome (AIDS) was identified. It was called Human Immunodeficiency Virus (HIV). Despite this discovery of fundamental importance, at the present there is no cure for this disease and no effective vaccine against HIV infection. The main difficulty is that our immune system as well as any drugs cannot deal with the inner nature of this virus that evolves constantly and rapidly. The genome of HIV has been sequenced hundreds of time since the eighties, so it is possible to study the differences between many individual genomes. This can help us to gain a general understanding of how virus evolves. In particular there are specific regions of its proteins that are recognized and attacked by our immune system: these regions mostly show the signature of adaptive evolution of the changing virus. Other regions instead remain invariant having important biological functions and not being involved in interactions with the immune system. In this example a genome wide analysis of natural selection in HIV is performed.
Neanderthal : HUMANS Example of sequence comparison by genetic distance
The discovery of Neanderthal skeletons in various parts of Europe raised many questions about human origis, among them the issue of our relation with these species. Now many questions about human and primate origins have been answered by the study of the mitochondrial genome and in particular of the hypervariable regions. These regions presents high sequence variability among humans, therefore they are ideal for studying the relationships among individuals. There are two hypervariable regions called HVR-I and HVR-II.
Chlamydia : Example of whole genome comparison
Human beings have multiple species of bacteria living within them. Most of these bacteria are not harmful to us and are considered beneficial symbionts. They provide us some benefic effects, for example producing chemicals necessary to our organism. Some symbionts have moved permanently into the cells of their hosts, becoming completely dependent from them. As a consequence of that, their genomes have undergone dramatic changes, losing most part of genes. Therefore intracellular symbiont genomes are some of the smallest known, both in the total size and in the number of genes. In these example the genomes of symbionts of Chlamydia family are examined. In particular we focus our attention on two symbionts: Chlamydia trachomatis (CT) and Chlamydia pneumoniae (CP). They are both intracellular symbionts of humans, also called parasites because they do not provide any benefit to the host. They show a very reduced metabolic and biosynthetic functions and have a very small genome (about 1 Mb length). Here we examine the relationship between them with the typical tools of whole genome comparison. The nucleotide sequences of both parasites can be downloaded from the GenBank database with the MATLAB function getgenbank.
Evening Element : Example of identification of regulatory sequences
Arabidopsis thaliana (also known as mouse-ear cress) is a small, weedy organism that has become the model genetic and genomic study system for plants. Its genome is approximately 120 Mb long, with five chromosomes and 29K genes. In this example we study in detail the circadian clock (i.e. the 24-hour cycle of the physiological processes) of this plant. It is known that in plants (but also in humans and other animals) the circadian clock synchronize itself with the external day-night cycle. In particular in Arabidopsis each cell keeps track of the day-night cycle indipendently of all other cells. This is due to the activity of a few key proteins that allow the plants to turn on and off large groups of genes needed at different times of day and night. They do so by binding to specific regulatory sequences called trascription factor binding sites. Regulatory DNA is the sequence surrounding a gene that specify proper trascription. It is a mosaic of short sequence motifs and semirandom DNA. These short motifs are usually found upstream of coding regions but they can also be found downstream. It is extremely difficult to identify these due to their short length, to a certain pecentage of variability in the sequence and because one does not know the motifs and much less the location. In this example we investigate some algorithms to find regulatory sequences for clock-regulated genes in Arabidopsis activated in the evening. Here we make the simplifying assumption that the motifs we are looking for is identical in every istance where it is found.
Cell Cycle : Example of gene expression profile analysis
The yeast (Saccharomyces cerevisiae) is a unicellular fungus found naturally in grapevines and responsible of wine making fermenting sugars and producing alchool. In this example we show some methods used in gene expression analysis for the study of its general cell cycle. From being budded off from its parent cell, to reproducing its own offspring, each yeast go through a number of typical step that also involve changes in gene expression, turning whole pathways on and off. Today the study of such phenomena is possible through the technology of microarray that can measure the expression level of every gene in a cell. With the gene expression data, genes can be clustered on the basis of the similarity of their expression profiles. Here we examine the expressions of the entire yeast genome through two rounds of the cell cycle. The temporal expression of genes are measured by microarray at 24 time points every five hours. In detail we have the expression profile of about 6000 genes.
Chloroplast : Example of sequence statistics and phylogenetic analysis
This demonstration investigates the relationships among plants and cyanobacteria based on nucleotide and amino acid sequences of the protein ribulose 1,5-biphosphate carboxylase (RubisCo) large subunit. In plants, the large subunit of Rubisco is encoded by genes in the chloroplast. This demo analyzes the characteristics of Rubisco genes in plant chloroplasts and in several cyanobacteria. Chloroplasts in fact are believed to have arisen from an ensymbiotic relationship between a eukaryotic precursor and a cyanobacteria, the engulfed cyanobacteria becoming chloroplats.
Nucleotide sequences for Rubisco were obtained from chloroplast and cyanobacteria genomes from GenBank database and saved as FASTA files. The first file contains specific nucleotide sequences obtained from 36 photosyntetic eukaryotes. This includes algae, ferns, club mosses, monocotyledons, dycotyledons (angiosperms and gymnosperms), thus representing much of the range of complexity and evolutionary history of plans. A second file instead contains sequences from 7 photosyntetic prokaryotes. It is necessary to have these files on your local drive to run this demo.
Dauer : Example of gene expression profile analysis
It is known that gene expression in eukaryotes is regulated by transcription factor (TF) through binding to a short piece of DNA in the upstream region. With the emerging of large-scale gene expression and genomic sequence studies, one could identify by which transcription factors a certain gene is regulated and predict how the gene will act subjected to change of environment, based on the presence of TF binding sites in its upstream sequence. With the gene expression data, genes can be clustered on the basis of the similarity of their expression profiles and these clusters are likely to contain genes that are regulated by the same transcription factors. Searches for cis-regulatory elements can then be undertaken in the upstream regions of the clustered genes. In this example we identify the cis-regulatory elements present in the genes responsible to the dauer exit process in Caenorhabditis elegans. C. elegans is a small soil nematode found in temperate regions. The dauer is an important developmental transition in C. elegans that exhibits increased longevity, stress resistance, altered metabolism compared with normal worms. The transition from the dauer state to the non-dauer state is performed here by detecting significant change of expression profiles of genes between the two states. The temporal expression of genes are measured by microarray at 10 to 12 time points every one or two hours. In detail we have the expression profile of 1984 genes within 12 hours, sampled approximately once an hour.
Eyeless : Example of sequence alignment with MATLAB
This demonstration compare the gene eyeless of Drosophila Melanoganster with the human gene aniridia. They are master regulatory genes producing proteins that control large cascade of other genes. Certain segments of genes eyeless of Drosophila melanogaster and human aniridia are almost identical. The most important of such segments encodes the PAX (paired-box) domain, a sequence of 128 amino acids whose function is to bind specific sequences of DNA. Another common segment is the HOX (homeobox) domain that is thougth to be part of more than 0.2% of the total nummber of vertebrate genes.
Iceman : Example of phylogenetic analysis
Ötzi the Iceman is a well-preserved natural mummy of a man frozen for about 5300 years and found in 1991 in a glacier of the Ötztal Alps, near the border between Austria and Italy. Recently researchers found interesting informations about it from the exam of mitochondrial DNA taken from cells of the iceman intestine. In particular from phylogenetic analysis they made some hypotesis about its mitochondrial haplogroup. A haplogroup is defined by set of characteristic mutations on the mitochondrial genome. Therefore it can be traced along a person’s maternal line and it can be used to group populations by genetic features. The famous book “The Seven Daughters of Eve” by Bryan Sykes describes the classification of all modern humans into into mitochondrial haplogroups and links each haplogroup to a specific prehistoric woman (“clan mothers”). In fact the branches of the mtDNA tree (composed of groups of people with related haplogroups) are continent-specific. In this example, we analyze the statistical properties and perform phylogenetic analysis of the mtDNA of the iceman to investigate the relationship with modern humans of different geographical locations and to determine useful information about its haplogroup.
Jukes Cantor : Example of Jukes-Cantor model of a sequence evolution
One of the crucial tasks in computational genomics is represented by the estimation of the genetic distance between two homologous sequences, that is the number of of substitutions that have accumulated between them since they diverged from their common ancestor. The problem is not easy since it means not simply count the number of position at which the two sequence differ. This underestimate the true genetic distance due to multiple substitutions occurred at the same site. To solve this problem, a common approach is to correct the observed genetic distance between two sequences by using a probabilistic model. Several models have been developed in the past. The simplest (the well known Jukes-Cantor model) assumes that each substitution has the same probability. In this demo we show the effect of the JC correction.
Lambda Phage : Example of sequence statistics and segmentation with MATLAB
Phages are viruses that infect bacteria, and Bacteriophage lambda infects the bacterium Escherichia coli, a very well studied model system. Bacteriophage lambda was the one of the first viral genomes to be completely sequenced (1982). It contains about 48502 bases. The Genome repository at the NCBI contains more interesting information about it.
Mammoth : Example of sequence comparison through genetic distance
As shown in the case of the human example, many question about the origin and the evolution of humans can be answered with the analysis and the comparison of mithocondrial dna sequences. Here we examine the relationship between the ancient wooly mammoth and the African and Asian elephants. Wooly mammoth is an extinct specie which lived in the Ice Age. The African and Asian elephants are the closest relatives of these Ice Age giants. The phylogenetic relationship of the mammoth to the African and Asian elephants has been assessed only recently, thanks to the reconstruction of several complete mitochondrial genome sequences of wooly mammoth by use of multiplex polymerase chain reaction (PCR). A standard procedure in such analysis consists in considering the mitochondrial sequence, removing the hypervariable regions and comparing the remaining part. In this way we expect to discard the noisiest part and to have sufficient variation in the coding part of the genome. The genome sequence of wooly mammoth and modern African and Asian elephants can be obtained from the GenBank database with the MATLAB function getgenbank. Between all possible DNA sequences available we select just three.
Saber Tooth : Example of phylogenetic analysis
America were once home of a huge variety of large felines such as the saber-tooth tigers, the scimitar toothed tigers and the America cheeta-like cats. Only the puma and the jaguar survive today. The saber-tooth and the scimitar toothed tigers and the America cheeta-like cats were species of predators that were extinct about 13000 years ago, towards the end of the last Ice Age. The analysis of the relationship of these old cats and many modern felines is possible thanks to the comparison of the DNA sequences available nowdays. In this example phylogenetic analysis is perfomed and the genetic distances between homologous sequences of different species are calculated. We compare partial cytochrome b sequences of the domestic cat, lion, leopard, tiger, puma, cheetah, African wild cat, Chinese desert cat, and European wild cat with the three extinct species of cats: Smilodon populator, (the saber-tooth tiger), Homotherium (the scimitar toothed tiger) and M. trumani (the American cheetah-like cat).
To perform our analysis we collect all the necessary data in three FASTA files including the corresponding fragments of amino acid and DNA sequences of all the cats obtained with local alignments from the sequences of GenBank. Additionally, there are sequences of other species, added in order to help root the tree and establish proper phylogeny. These are the gray wolf, domestic dog, spotted hyena, striped hyena, black bear, brown bear, and cave bear (now extinct).
Mycoplasma : Example of gene finding with MATLAB
Mycoplasmas are members of the class Mollicutes and comprise a large group of bacteria which lack a cell wall, have small genomes and a characteristically low G+C content. Mycoplasmas are of interest because they are believed to represent a minimal life form, having yielded to selective pressure to reduce genome size. The species with the smallest genome size in this class is Mycoplasma genitalium (580 kb).
Olfactory Receptors : Example of use of HMM in sequence analysis
Proteins are composed of a sequence of amino acids. These aminoacids have various atomic compositions and structures that lead to different properties. The first part of this demo focuses on the property of hydrophobicity.
Olfactory receptors (OR) are part of a family of proteins that have 7 transmembrane regions. That is they pass through the cell membrane 7 times. The interior of the cell membrane is hydrophobic while both the exterior and interior of the cell are hydrophilic. Therefore, the regions of the protein that pass through the membrane should contain mostly hydrophobic amino acids while the portion outside of the membrane should be mostly hydrophilic.
HAEMOPHILUS : Example of sequence statistics and gene finding with MATLAB
Haemophilus influenzae is a non-motile Gram-negative coccobacillus first described in 1892 by Dr. Robert Pfeiffer during an influenza pandemic. It is generally aerobic, but can grow as a facultative anaerobe. It is responsible for a wide range of clinical diseases. The Haemophilus influenzae genome was the first to be sequenced and assembled in a free-living organism. It contains about 1.8 million base pairs and is estimated to have 1,740 genes.