Quantifying Heterogeneity in the HIV Genome (original) (raw)

Two-Sample Tests for Comparing Intra-Individual Genetic Sequence Diversity between Populations

Biometrics, 2005

Consider a study of two groups of individuals infected with a population of genetically related heterogeneous mixture of viruses, and multiple viral sequences are sampled from each person. Based on estimates of genetic distances between pairs of aligned viral sequences within individuals, we develop four new tests to compare intra-individual genetic sequence diversity between the two groups. This problem is complicated by two levels of dependency in the data structure: (i) Within an individual, any pairwise distances which share a common sequence are positively correlated; and (ii) For any two pairings of individuals which share a person, the two differences in intra-individual distances between the paired individuals are positively correlated. The first proposed test is based on the difference in mean intra-individual pairwise distances pooled over all individuals in each group, standardized by a variance estimate that corrects for the correlation structure using U-statistic theory. The second procedure is a nonparametric rank-based analog of the first test, and the third test contrasts the set of subject-specific average intra-individual pairwise distances between the groups. These tests are very easy to use and solve correlation problem (i). The fourth procedure is based on a linear combination of all possible U-statistics calculated on independent, 1 identically distributed sequence sub-datasets, over the two levels (i) and (ii) of dependencies in the data, and is more complicated than the other tests but is generally more powerful.

An Efficient Test for Comparing Sequence Diversity between Two Populations

Journal of Computational Biology, 2001

We address the problem of comparing interindividual genomic sequence diversity between two populations. Although the methods are general, for concreteness we focus on comparing two human immunode ciency virus (HIV) infected populations. From a viral isolate(s) taken from each individual in a sample of persons from each population, suppose one or multiple measurements are made on the genetic sequence of a coding region of HIV. Given a de nition of genetic distance between sequences, the goal is to test if the distribution of interindividual distances differs between populations. If distances between all pairs of sequences within each group are used, then data-dependencies arising from the use of multiple sequences from individuals invalidates the use of a standard two-sample test such as the t-test. Where this problem has been recognized, a typical solution has been to apply a standard test to a reduced dataset comprised of one sequence or a consensus sequence from each patient. Disadvantages of this procedure are that the conclusion of the test depends on the choice of utilized sequences, often an arbitrary decision, and exclusion of replicate sequences from the analysis may needlessly sacri ce statistical power. We present a new test free of these drawbacks, which is based on a statistic that linearly combines all possible standard test statistics calculated from independent sequence subsamples. We describe statistical power advantages of the test and illustrate its use by application to nucleotide sequence distances measured from HIV-1 infected populations in southern Africa (GenBank accession numbers AF110959-AF110981) and North America/Europe. The test makes minimal assumptions, is maximally ef cient and objective, and is broadly applicable.

INVITED REVIEW: What is a population? An empirical evaluation of some genetic methods for identifying the number of gene pools and their degree of connectivity

Molecular ecology, 2006

We review commonly used population definitions under both the ecological paradigm (which emphasizes demographic cohesion) and the evolutionary paradigm (which emphasizes reproductive cohesion) and find that none are truly operational. We suggest several quantitative criteria that might be used to determine when groups of individuals are different enough to be considered 'populations'. Units for these criteria are migration rate ( m ) for the ecological paradigm and migrants per generation ( Nm ) for the evolutionary paradigm. These criteria are then evaluated by applying analytical methods to simulated genetic data for a finite island model. Under the standard parameter set that includes L = 20 High mutation (microsatellitelike) loci and samples of S = 50 individuals from each of n = 4 subpopulations, power to detect departures from panmixia was very high (∼ ∼ ∼ ∼ 100%; P < 0.001) even with high gene flow ( Nm = 25). A new method, comparing the number of correct population assignments with the random expectation, performed as well as a multilocus contingency test and warrants further consideration. Use of Low mutation (allozyme-like) markers reduced power more than did halving S or L . Under the standard parameter set, power to detect restricted gene flow below a certain level X (H 0 : Nm < X ) can also be high, provided that true Nm ≤ ≤ ≤ ≤ 0.5 X . Developing the appropriate test criterion, however, requires assumptions about several key parameters that are difficult to estimate in most natural populations. Methods that cluster individuals without using a priori sampling information detected the true number of populations only under conditions of moderate or low gene flow ( Nm ≤ ≤ ≤ ≤ 5), and power dropped sharply with smaller samples of loci and individuals. A simple algorithm based on a multilocus contingency test of allele frequencies in pairs of samples has high power to detect the true number of populations even with Nm = 25 but requires more rigorous statistical evaluation. The ecological paradigm remains challenging for evaluations using genetic markers, because the transition from demographic dependence to independence occurs in a region of high migration where genetic methods have relatively little power. Some recent theoretical developments and continued advances in computational power provide hope that this situation may change in the future.

Gene genealogy and variance of interpopulational nucleotide differences. Genetics

Genetics

A mathematical theory is developed for computing the probability that m genes sampled from one population (species) and n genes sampled from another are derived from 1 genes that existed at the time of population splitting. The expected time of divergence between the two most closely related genes sampled from two different populations and the time of divergence (coalescence) of all genes sampled are studied by using this theory. it is shown that the time of divergence between the two most closely related genes can be used as an approximate estimate of the time of population splitting ( T ) only when T = t l ( 2 N ) is small, where t and N are the number of generations and the effective population size, respectively. The variance of Nei and Li's estimate ( d ) of the number of net nucleotide differences between two populations is also studied. It is shown that the standard error ( s d ) of d is larger than the mean when T is small ( T << 1). In this case, sd is reduced considerably by increasing sample size. When T is large (T > I), however, a large proportion of the variance of ' Present address:

Gene genealogy and variance of interpopulational nucleotide differences

Genetics, 1985

A mathematical theory is developed for computing the probability that m genes sampled from one population (species) and n genes sampled from another are derived from l genes that existed at the time of population splitting. The expected time of divergence between the two most closely related genes sampled from two different populations and the time of divergence (coalescence) of all genes sampled are studied by using this theory. It is shown that the time of divergence between the two most closely related genes can be used as an approximate estimate of the time of population splitting (T) only when T identical to t/(2N) is small, where t and N are the number of generations and the effective population size, respectively. The variance of Nei and Li's estimate (d) of the number of net nucleotide differences between two populations is also studied. It is shown that the standard error (Sd) of d is larger than the mean when T is small (T much less than 1). In this case, Sd is reduced...

Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data

Genetics, 1992

We present here a framework for the study of molecular variation within a single species. Information on DNA haplotype divergence is incorporated into an analysis of variance format, derived from a matrix of squared-distances among all pairs of haplotypes. This analysis of molecular variance (AMOVA) produces estimates of variance components and F-statistic analogs, designated here as @-statistics, reflecting the correlation of haplotypic diversity at different levels of hierarchical subdivision. The method is flexible enough to accommodate several alternative input matrices, corresponding to different types of molecular data, as well as different types of evolutionary assumptions, without modifying the basic structure of the analysis. The significance of the variance components and @-statistics is tested using a permutational approach, eliminating the normality assumption that is conventional for analysis of variance but inappropriate for molecular data. Application of AMOVA to human mitochondrial DNA haplotype data shows that population subdivisions are better resolved when some measure of molecular differences among haplotypes is introduced into the analysis. At the intraspecific level, however, the additional information provided by knowing the exact phylogenetic relations among haplotypes or by a nonlinear translation of restriction-site change into nucleotide diversity does not significantly modify the inferred population genetic structure. Monte Carlo studies show that site sampling does not fundamentally affect the significance of the molecular variance components. The AMOVA treatment is easily extended in several different directions and it constitutes a coherent and flexible framework for the statistical analysis of molecular data.

Molecular Population Genetics

Genetics, 2017

Molecular population genetics aims to explain genetic variation and molecular evolution from population genetics principles. The field was born 50 years ago with the first measures of genetic variation in allozyme loci, continued with the nucleotide sequencing era, and is currently in the era of population genomics. During this period, molecular population genetics has been revolutionized by progress in data acquisition and theoretical developments. The conceptual elegance of the neutral theory of molecular evolution or the footprint carved by natural selection on the patterns of genetic variation are two examples of the vast number of inspiring findings of population genetics research. Since the inception of the field, Drosophila has been the prominent model species: molecular variation in populations was first described in Drosophila and most of the population genetics hypotheses were tested in Drosophila species. In this review, we describe the main concepts, methods, and landmar...

Population Genetics in the Genomic Era

Studies in Population Genetics, 2012

Besides, genomic approach has also greatly facilitated the identification of diseases/traits associated genes. Genome-wide association studies (GWAS), genotyping millions of SNPs on thousands of individuals, has become a standard method in disease gene discovery in the past several years. However, the common variants identified by GWAS only account for a small fraction of the heritability thus fail to explain the majority of phenotypic variance in population. Therefore, as an alternative to the common disease common variants hypothesis (CDCV), several new hypotheses have been proposed. In this chapter, we will focus only on those topics concerned with population genomics, i.e. methods, statistics and analysis based on high-density genome-wide data (either genotyping data or sequencing data), so the research category can be different from that of traditional population genetic studies relying on single locus or sparse loci. 2. Variation, recombination, haplotypes and inference of population parameters 2.1 Overview of genome-wide high-density data Based on the current technologies and features, the genome-wide data can be roughly classified into genotyping data and sequencing data. DNA Genotyping is the process of determining the status of DNA using biological assays and comparing it with known sequences. It is used either to track the alleles an individual inherited from his/her parents, or to reveal differentiations between individuals and populations. With most SNPs discovered in a small set of samples, the genotyping data have high proportion of SNPs with intermediate allele frequency. This ascertainment bias is likely to affect all statistics based on allele frequencies[1]. DNA Sequencing, on the other hand, includes several methods and technologies to determine the order of the nucleotides in DNA. The high demand for sequencing data has promoted the development of low-cost high-throughput sequencing technologies (also referred to as next-generation sequencing) that parallelize the sequencing process by producing thousands or millions of sequences at once[2]. These low-cost high-throughput technologies, including 454 Life Sciences (Roche) sequencing, Illumina Solexa sequencing, and Applied Biosystems SOLiD sequencing, will finally make the individual genomes affordable and accessible, initiating individual genomic era. 2.2 Genetic variations in human genome Genetic variations refer to any genetic differences among individuals within one population or species, which provide the genetic basis of evolution. Since the nucleotide differentiation between individuals is estimated to be about 0.1%[3], meaning that there are about 3-million nucleotide differences between two unrelated individuals. The genetic variations in human genome can be classified into single nucleotide polymorphism (SNP), short insertion and deletion (indel), copy number variation (CNV), variable number tandem repeat (VNTR: including microsatellite and minisatellite), haplotype (including haplogroup), epigenetic and so on[4]. Among all these, SNP is a type of variation with one nucleotide differentiation in sequence, which is generally caused by single mutations, and it is estimated that there are about 30 million SNPs existing in human genomes, which makes them the most common genetic variations in human genomes.

Locus-specific genetic diversity between human populations: An analysis of the literature

American Journal of Human Biology, 2003

The debate over classification of the human species according to racial or continental lines has involved reports on genetic differences in allele frequencies of a number of loci with important biomedical functions. Such differences are in contrast with the fact that, for human beings, intrapopulation genetic diversity is larger than that seen between populations. In an attempt to address the hypothesis that certain genes show high interpopulation diversity due to selective pressure, the literature was surveyed to quantify such diversity using Wrights Fst statistic. The gene-specific Fst values were then compared to pairwise population values of Fst taken over a large number of genes, which presumably reflect mostly neutral mechanisms of genetic diversity such as drift. The results showed that the majority of pairwise population values of Fst for over 30 genes of biomedical significance were either below or within the expected limits of Fst based on published values. These results do not support the idea that positive or diversifying natural selection plays an important role in increasing genetic diversity, even in genes that might be expected to be subject to selection pressure. Balancing selection, whereby the degree of genetic diversity is actually lower than that expected, appears to occur more frequently for these genes. The fact that allele frequency differences between populations might be ''statistically significant'' does not therefore necessarily imply a degree of genetic diversity greater than would be expected due to nonselective mechanisms. Am.