Abundant Raw Material for Cis-Regulatory Evolution in Humans (original) (raw)

Journal Article

,

Search for other works by this author on:

Search for other works by this author on:

Published:

01 November 2002

Navbar Search Filter Mobile Enter search term Search

Abstract

Changes in gene expression and regulation—due in particular to the evolution of _cis_-regulatory DNA sequences—may underlie many evolutionary changes in phenotypes, yet little is known about the distribution of such variation in populations. We present in this study the first survey of experimentally validated functional _cis_-regulatory polymorphism. These data are derived from more than 140 polymorphisms involved in the regulation of 107 genes in Homo sapiens, the eukaryote species with the most available data. We find that functional _cis_-regulatory variation is widespread in the human genome and that the consequent variation in gene expression is twofold or greater for 63% of the genes surveyed. Transcription factor-DNA interactions are highly polymorphic, and regulatory interactions have been gained and lost within human populations. On average, humans are heterozygous at more functional _cis_-regulatory sites (>16,000) than at amino acid positions (<13,000), in part because of an overrepresentation among the former in multiallelic tandem repeat variation, especially (AC)n dinucleotide microsatellites. The role of microsatellites in gene expression variation may provide a larger store of heritable phenotypic variation, and a more rapid mutational input of such variation, than has been realized. Finally, we outline the distinctive consequences of _cis_-regulatory variation for the genotype-phenotype relationship, including ubiquitous epistasis and genotype-by-environment interactions, as well as underappreciated modes of pleiotropy and overdominance. Ordinary small-scale mutations contribute to pervasive variation in transcription rates and consequently to patterns of human phenotypic variation.

Introduction

Variation in noncoding _cis_-regulatory DNA sequences has been advanced as a major component of the genetic basis for phenotypic evolution (Britten and Davidson 1969 ; King and Wilson 1975 ; Stern 2000 ; Tautz 2000 ; Carroll, Grenier, and Weatherbee 2001 ; Davidson 2001 ; Stone and Wray 2001 ; Enard et al. 2002 ). This view is supported by interspecific analyses documenting dramatic changes in patterns of gene expression coupled with functional and structural conservation of proteins. Models of regulatory rewiring and co-option have proliferated to explain the evolution of developmental and morphological diversity. However, although intraspecific protein sequence polymorphism has been studied for decades, we lack even basic information about the level and nature of functional _cis_-regulatory variation within populations. Heritable variation in gene expression has been described (Damerval et al. 1994 ; Cavalieri, Townsend, and Hartl 2000 ; Jin et al. 2001 ; Brem et al. 2002 ), but the genetic basis of this variation represents, with very few exceptions (Stam and Laurie 1996 ; Crawford, Segal, and Barnett 1999 ; Schulte et al. 2000 ), a major lacuna in the literature of molecular evolution. We have turned to the literature of human medical genetics to fill this gap. Characterization of functional _cis_-regulatory variation, its effects on transcription, and its consequent influence on phenotypes is critical for a complete understanding of the genetic basis of phenotypic evolution. Molecular evolutionary studies of regulatory polymorphisms represent a promising avenue for the synthesis of the quantitative models of molecular population genetics and for the mechanistic and descriptive studies of developmental and macroevolutionary phenomena.

Variation in coding sequence is classified as nonsynonymous and synonymous, reflecting nucleotide variants with and without an effect on phenotype at the level of protein primary structure. Currently, however, we lack the means to distinguish functional regulatory variants in noncoding DNA from nucleotide sequences alone. Although noncoding variation has been implicated as the genetic basis for phenotypic variation by means of QTL mapping and association studies (e.g., Sucena and Stern 2000 ), such methods are typically unable to attribute effects or modes of action to specific nucleotides because of potential complications from linkage disequilibrium. We therefore assembled from the literature a data set of human _cis_-regulatory polymorphisms meeting stringent experimental criteria for functionality. Specifically, the functional import of each polymorphism has been validated by allele-specific reporter construct assays in cell culture. From this data set, we estimate the basic characteristics of functional _cis_-regulatory variation and describe the implications of this variation for regulatory and phenotypic evolution.

Materials and Methods

Literature Survey

We searched the literature (through December 2001) for articles relating to regulatory polymorphisms in humans. We used data from these articles to construct several nested data sets relating to sequence variation, reporter assays, allele frequencies, phenotypic associations, and protein binding. The final data sets, and the full list of the more than 400 references from which they are extracted, are available as supplementary material. The study of _cis_-regulatory variants introduces some unique challenges with consequences for our analyses. Unlike coding sequence, _cis_-regulatory function is inherently context dependent. In particular, the functional consequences of a _cis_-regulatory polymorphism depend on cell type, temperature, the distribution of exogenous inducers, and covariation at other sites in the genome. Moreover, experimental methods used to identify functional variants introduce additional variation; for example, a variant may have no effect on transcription unless the construct includes introns or downstream elements that physically interact with it to transduce the transcriptional output. The spatial extent of relevant DNA is not defined, and important regulatory elements can lie hundreds of kilobases from transcription start sites. The consequence is that whereas coding variants are readily classified as synonymous or nonsynonymous, assignment of functional consequences to a _cis_-regulatory polymorphism requires analysis of the variants over a vast multidimensional range of conditions. Standardized, high-throughput methods will fail to detect many, if not most, functional variants. We therefore view literature survey, in which each variant has been studied in depth by specialists, as the best approach presently available to study the dynamics of functional _cis_-regulatory variants. Our reliance on a sample of convenience introduces certain deviations from a perfectly representative sample of variants from the genome. Any study of a selected subset of the genome, including recent large-scale studies of coding sequence variants in medically interesting genes (e.g., Cambien et al. 1999 ; Cargill et al. 1999 ; Halushka et al. 1999 ), relies on a nonrandom sample. We discuss the consequences of our sampling strategy below. Our survey stands in the tradition of other early analyses of human genetic samples of convenience that, despite their lack of strict random sampling, expanded our understanding of human genetic variation and evolution, including population subdivision (Lewontin 1972 ), molecular evolution (King and Wilson 1975 ), and nucleotide diversity (Li and Sadler 1991 ).

Inclusion Criteria

The primary data set consists of protein coding loci whose transcription is influenced by _cis_-regulatory polymorphisms. These genes meet two inclusion criteria. First, sequence variants have a statistically significant effect on transcription rate in the context of allelic reporter constructs transfected into a physiologically relevant cell line. Second, the responsible variants have rare allele frequencies greater than 1% in a human population. Genes are included if a reporter construct experiment shows allelic differences in transcription rate that the authors of the study characterize as statistically significant, even if other experiments show no significant differences. This approach is necessitated by the pervasive context-dependence of regulatory DNA function. The 1% frequency criterion eliminates a large number of rare variants whose pathological consequences brought their carriers to clinical attention. Functional noncoding variants that affect RNA stability and localization, splicing, and translational initiation or efficiency, though important components of the functional noncoding genome, are excluded (e.g., UTR variants in LPA, TYMS, and F12). In addition, we have excluded somatically unstable repeats (“dynamic mutation”) that affect transcription (e.g., repeats near SIX5, FMR1, CSTB, and FRDA). These noncoding variants may cause disease in their hyperexpanded forms, but the effect of ordinary, nonpathological variation in repeat number at these loci is unknown. Our experimental criterion excludes many likely functional variants for which the appropriate experiments have not yet been performed. For example, several polymorphic nucleotide variants are known to affect the binding of transcription factors and have also been implicated in variation in transcription rate by way of fine-scale linkage mapping and association studies, but in the absence of reporter construct assays these genes (e.g., COL1A1 and VWF) are omitted from our data set. The primary data set includes 107 genes, of which 106 are separately named genes; the GCK gene is counted twice because it has two tissue-specific promoters and first exons, separated by 16 kb, each with functional _cis_-regulatory polymorphisms.

Fold-Difference Estimates

We tabulated the fold-difference in transcription level between alleles at each locus. These data represent the fold-difference observed in the experimental design giving the largest average difference between alleles. This maximum difference may be justified as likely the most physiologically important situation, but its use is also a practical necessity. Because expression level in vivo is a function of cell type, inducer concentration, and genetic background, and because experimental study introduces the additional variable of reporter construct design, there is no way to characterize a “typical” expression difference between alleles at a locus. In the extreme case, for instance, alleles will show no difference in transcription rate when present in cell types that do not express the gene. Use of the maximum difference introduces two opposing ascertainment biases. First, the statistical power to detect small differences is a function of sample size (the number of independent transfection experiments) and the magnitude of the experimental error, so small differences between alleles may be underrepresented relative to their true frequency. Second, many variants have been studied in only one or a few cell types and conditions, and none has been observed in embryos, so for most loci the true maximum allelic difference in transcription rate may not have been observed, leading to an underrepresentation of large differences.

Corroborative Association Studies

We sought corroboration of functionality for the experimentally validated variants by identifying published associations between the variants and expected phenotypes. At the biochemical level, we tabulated statistically significant associations between the functional _cis_-regulatory variants and in vivo measurement of gene expression. At the organismal phenotype level, we tabulated statistically significant associations between the variants and any phenotype that the authors of the study expect to be associated with the expression of the gene. The phenotypes include morphometric, physiometric, and psychometric traits, disease risk and outcome, and pathogen susceptibility and transmission risk.

Functional Variants

Data sets of mutation types and positions were assembled from 101 genes in the 107-gene primary data set. Variants are included in these secondary data sets only if the specific nucleotide variant is implicated in allelic variation in transcription by reporter construct experiments. Consequently, six of the 107 genes in the primary data set are not represented because allelic variation is experimentally attributable only to haplotypic variation; which of the varying nucleotides and how many of then contribute to the transcription rate variation are unknown for these genes. Other genes are represented by multiple, individually tested functional variants. The data set of mutation types includes 144 observations. Biallelic tandem repeat polymorphisms are included in the count of indels, whereas multiallelic tandem repeat polymorphisms are classified separately as variable number tandem repeats (VNTRs). We separated single nucleotide polymorphisms (SNPs) into categories without regard to strand orientation (for example, a T/G SNP is counted together with A/C SNPs because they represent the same base-pair exchange, differing only in strand). This approach allows comparison with published tallies of genomic SNP types and is additionally justified by the fact that _cis_-regulatory DNA function is mediated by the structure of double-stranded DNA, i.e., there is no _cis_-regulatory “sense” strand. The data set of positions of variants relative to transcription start sites includes 141 observations from 100 genes, after excluding genes whose starts of transcription are inadequately mapped. For first exons with variable start sites (i.e., multiple mapped starts within a 100-bp region), we use the 5′-most major start site. For variants occupying multiple positions (e.g., indels and VNTRs), we calculated distance from the nucleotide of the variant that is nearest to start of transcription.

Protein-DNA Interactions

We assembled a data set of functional nucleotide variants (i.e., those in the mutation types data set) whose interactions with transcription factors have been experimentally determined to show allelic variation. The experimental criterion for inclusion in this data set is the demonstration of differential transcription factor binding in an electromobility shift assay using each allele as a probe for binding to proteins from nuclear extracts from relevant cell types. We use this experimental criterion because although transcription-factor binding sites can be predicted on the basis of sequence similarity to a consensus sequence, binding affinity often varies within the consensus, many consensuses are ill-defined, and many binding sites are as yet wholly uncharacterized. We have not counted inferred binding polymorphisms without experimental verification; for example, several functional variants that alter TATAA boxes (adjacent to the CYP2A6, HSD17B2, and UGT1A1 genes) are excluded from this tally because differential protein binding has not been demonstrated by gel shift assays.

Frequency Data

We collected published allele frequency data for 129 of the functional variants. These data include observations of 241,008 chromosomes, a mean of 1,868 per variant (median 1,049). Among the sets of variants showing complete haplotypic association, we counted only a single representative site; however, in cases of incomplete or unknown linkage disequilibrium, we count each site individually. Some functional variants are unrepresented in the frequency data set because population samples are not available, although in each case the data are sufficient to indicate that the functional variants meet the 1% rare allele frequency threshold. Frequency data are derived from random populations or from the control populations of clinical studies; the homogeneity and geographic origins of the populations vary among the studies. As a representative statistic, we have used the midpoint of the range (over populations) of rare allele frequencies for each locus. For example, we have frequency data for APOE −491 from nine populations. The rare allele, −491T, has a minimum frequency of 0.113 (among Finns) and a maximum frequency of 0.305 (among African-Americans). The midpoint of this range is 0.209. For each population, we estimated the expected heterozygosity as one minus the sum of the squared allele frequencies. We then calculated the midpoint of the range of heterozygosities as a representative statistic for each variant. Note that the use of the minimum heterozygosities and frequencies for each variant (including zeroes for the 10 private polymorphisms) has little effect on our overall conclusions.

Results and Discussion

Abundant Functional _cis_-Regulatory Variation

Experimentally verified functional _cis_-regulatory polymorphisms influence the rate of transcription from 107 genes spread over 20 autosomes and the X chromosome (fig. 1 ). This number represents 1% of all officially named human genes (Locuslink, December 19, 2001). Because these named genes include all genes that have been subject to even minimal study, 1% represents a firm, minimum estimate of the fraction of genes with functional _cis_-regulatory polymorphism. Genes from a range of functional categories are represented, including metabolic and regulatory enzymes, intercellular transporters, transcription factors, signal transducing receptors and their ligands, proteinase inhibitors, extracellular matrix proteins, components of the immune system, and cell adhesion molecules.

The functional variants have large effects on transcription (fig. 2_A_ ). In reporter construct assay experiments, 63% of the 107 genes have allelic differences of twofold or greater in their rates of transcription, and 20-fold differences are not uncommon.

To corroborate the experimental evidence for functionality of these variants in cell culture conditions, we sought support from studies documenting associations in vivo between the polymorphisms in their native chromosomal context and proximal biochemical phenotypes. We found such published support for 59% of the 107 genes. We next sought corroborative evidence from associations with predicted higher level phenotypes, including morphometric traits and disease susceptibilities. We obtained such evidence for 71% of the genes. At least one of the two lines of corroboration was found for 82% of the genes, and both lines were found for 47%. In most cases where corroboration is absent, the relevant studies have not yet been undertaken or published.

Characteristics of Functional Variants

Figure 2_B_ shows the spectrum of mutation types among 144 functional variants (see Materials and Methods for an explanation of the sample size, and see Supplementary Material for a detailed list). The spectrum is largely typical of segregating human genetic variation and of substitutions fixed in the human lineage, including the ratio of SNPs to biallelic indels, and the pattern of base pair exchanges among the SNPs (fig. 2_C_ ) (Taillon-Miller et al. 1998 ; Cambien et al. 1999 ; Nachman and Crowell 2000 ; Taillon-Miller and Kwok 2000 ; Venter et al. 2001 ; Yu et al. 2001 ). Ordinary small-scale mutations are thus major contributors to heritable variation in transcription (Stone and Wray 2001 ).

The spectrum of mutation types is atypical, however, in its proportion of multiallelic variants (those with more than two alleles, such as polymorphic microsatellites). At 20% (29/144), the proportion of multiallelic variants among functional _cis_-regulatory variants is higher than the proportion found in surveys of random human variation, typically around 5%, though precise numbers are scarce (Taillon-Miller et al. 1998 ; Cambien et al. 1999 ). Nineteen of the functional variants (13%) involve VNTRs with repeat unit length less than 10 nucleotides; 12 involve dinucleotide repeats. Among these dinucleotide microsatellites, (AC)n-type repeats are overrepresented (two-tailed binomial P = 0.006); these constitute 50% of the dinucleotide microsatellites genome wide (International Human Genome Sequencing Consortium 2001 ) but 92% (11/12) in our data set.

The first 500 bp upstream of the transcription start sites contain 58.9% (83/141; see Materials and Methods) of the variants, but a substantial fraction are found further afield; 12.8% are more than 1 kb upstream, and another 12.8% fall 3′ to the start of transcription. Two variants (1.4%) occur more than 10 kb upstream of their start sites. Given the ascertainment bias favoring discovery of variants in the immediate 5′ flanking sequence, these data indicate the broad spatial distribution of variation influencing transcription, as expected, given the broad distribution of functional regulatory elements (e.g., Carey and Smale 2000 , pp. 61–62).

Polymorphic DNA–Protein Interactions

Allelic variation in transcription rate may be due to allelic variation in affinity for transcription factor binding. We sought published biochemical evidence for such variation for the functional variants in our data set.

For 51 variants, we inferred whether the allele with experimentally determined higher affinity for transcription factor binding is associated with the activation of transcription or its repression. We found 26 instances of polymorphic activator binding and 25 instances of polymorphic repressor binding. An additional nine loci showed transcription factor switching, with each allele having higher affinity for a different transcription factor.

Our inference of activation or repression is based on the evidence from reporter construct assays in cell lines and ignores the context-dependence of _cis_-regulatory function. For example, the influence of a polymorphic Sp1-Sp3 binding site in the CD14 promoter depends on the ratio of Sp1 to Sp3 in the nucleus; because this ratio varies among cell types, protein binding to the site may result in either repression or activation (Levan et al. 2001 ). Nevertheless, our inference may be justified by noting that experimental studies typically focus on the cell types in which expression of the gene is thought to be of the greatest physiological consequence, and in most cases only a single effect, activation or repression, is documented for each polymorphic interaction.

We inferred ancestral states for 20 variants for which both experimental protein binding data and nonhuman primate sequences are available. Of these, the derived state is a gain of transcription factor binding in seven cases and a loss in 10 cases. The other three cases involve transcription factor switching. Two of the seven gains are human-specific expansions of tandem repeat sequences (IGF2/INS and TH). Two functional variants from the HLA-DQB1 locus in the MHC complex are transspecific polymorphisms, i.e., both alleles segregate in both humans and chimpanzees. Considering more distant out-groups, we record these changes as one gain and one loss of transcription factor binding in the human-chimpanzee common ancestor.

Transcription factors from a diversity of structural families are involved in polymorphic binding, and many, including USF, Sp1, NFκB, GATA, and OCT, are involved in polymorphic binding to the _cis_-regulatory regions of more than one gene. Sp1 alone is involved in seven experimentally confirmed polymorphic functional DNA-protein interactions in our data set.

Five categories of regulatory change are represented in our small sample of human polymorphisms: gains and losses of repressor binding and of activator binding, and transcription factor switching. If we think of gene regulation as a network of linkages between transcription factors and _cis_-regulatory DNA, these data illustrate the extent to which the structure of the network is variable even within populations.

Population Genetics

The effects of _cis_-regulatory variation in human populations depend both on the number of segregating polymorphisms and on the heterozygosities of these polymorphisms. We collected frequency data from the literature for 129 functional variants from our 107 genes (table 1 and fig. 3_A_ ; see Materials and Methods). The distribution of rare allele frequencies is fairly uniform (fig. 3_B_ ) and thus deviates substantially, with an excess of intermediate frequency variants, from the neutral expectation (Hartl and Clark 1997 , pp. 294–304) and from observed frequencies for both coding and other noncoding SNPs, which are skewed toward low-frequency alleles in humans (Cargill et al. 1999 ; Halushka et al. 1999 ; Zwick, Cutler, and Chakravarti 2000 ; Stephens et al. 2001 ). The observed distribution is attributable, in part, to an ascertainment bias—clinical studies of common health factors may ignore rare variants and consequently experimental validation of functional effects will focus on intermediate-frequency variants. Nevertheless, the unusual shape of the distribution may suggest that natural selection contributes to _cis_-regulatory polymorphism in humans. Directional selection with geographically heterogeneous selection regimes accounts for variation in the _cis_-regulatory region of the FY locus (Hamblin and Di Rienzo 2000 ) and possibly the CCR5 locus as well (Schliekelman, Garner, and Slatkin 2001 ). Overdominant balancing selection may maintain variation in the _cis_-regulation of MHC genes (Guardiola et al. 1996 ), whereas variation at UGT1A (Beutler, Gelbart, and Demina 1998 ) and TNF (Wilson et al. 1997 ) may be maintained by balancing selection because of antagonistic pleiotropy. Natural selection operating according to Neel's (1962) “thrifty genotype” hypothesis has been invoked to explain variation at the AGT (Inoue et al. 1997 ), CAPN10 (Baier et al. 2000 ), and INS (McCarthy 1998 ) loci. Although these models are at present little more than verbal scenarios (with the exception of the FY and CCR5 variants), distinctive properties of _cis_-regulatory variants discussed below, such as tissue-type overdominance, genotype-by-environment interactions, and interactions between epistasis and linkage, may make _cis_-regulatory polymorphisms especially prone to maintenance by natural selection.

We have inferred ancestral states for 21 biallelic variants for which we have frequency data. Derived states have higher frequencies than do ancestral states at seven of these loci, at least in some populations. There are 11 additional variants for which ancestral states are unknown but that have frequencies on either side of 0.5 in different populations. These data illustrate the fact that functional _cis_-regulatory polymorphisms are contributors not only to transient human variation but likely also to divergence of humans from our ancestors.

Population Genomics

Our data permit some preliminary and cautious estimates of the extent of _cis_-regulatory variation in the genome. Because of the way variants are sampled in our data set and because of the inherent context-dependence of the _cis_-regulatory function, we cannot address the usual molecular population genomics parameters such as π, the per-site heterozygosity at functional _cis_-regulatory sites, or S, the number of segregating functional _cis_-regulatory sites in a sample of specified size. We concentrate instead on the total number of segregating functional _cis_-regulatory polymorphisms (i.e., variants with a rare allele population frequency greater than 1%) genome wide and on the expected number of heterozygous sites genome wide, considering only segregating functional _cis_-regulatory polymorphisms.

A recent analysis of upstream sequences from 180 loci (Stephens et al. 2001 ) found that the total frequency of SNPs in the proximal 850 bp of the 5′ flanking sequence averages 5.73/kb (this number is basically a large sample estimate of _cis_-regulatory S because it includes rare variants with frequencies less than 1%). Genes in our data set average 0.94 functional SNPs per kilobase in the proximal 850 bp 5′ to the start of transcription. (Independently functional nucleotides have been both identified and mapped in relation to the start of transcription in 100 of 107 genes; we thus calculate on the basis of 80 SNPs in a sample of 85 kb, i.e., 100 genes times 850 bp per gene.) A crude comparison suggests that, if our genes are typical, more than 16.4% of proximal promoter SNPs influence transcription. Although not directly comparable, large sample estimates of the percentage of human coding sequence variants that alter an amino acid range from 44% to 56% (Cargill et al. 1999 ; Halushka et al. 1999 ; Stephens et al. 2001 ; Venter et al. 2001 ).

Our estimate of 0.94 functional SNPs per kilobase is affected by two ascertainment biases, one among genes and one among nucleotide variants. First, our genes may be unrepresentative; they were included specifically because they contain functional variants. We suspect that this bias is minor: our genes are drawn from a wide range of functional categories and chromosomes, and in the vast majority of the studies from which our data are drawn, genes were selected with no a priori expectation of functional _cis_-regulatory variation. In most cases, the polymorphisms were identified by medical researchers who first selected genes known from model system biology as candidate genes for medical conditions (for example, apolipoproteins as candidate genes for cardiovascular phenotypes). The genes selected were then searched for polymorphisms, some of which were experimentally tested for function. Moreover, genes with functional _cis_-regulatory variants are included in our data set, irrespective of whether the variants are SNPs or whether they occur in the proximal 850 bp. Nevertheless, bias among genes, in particular a publication bias favoring genes with functional polymorphisms, may result in an inflation of the density of functional SNPs in our data set. A second ascertainment bias, bias among nucleotide variants, affects the estimate in the opposite way, by undercounting functional variants. Specifically, our estimate assumes that every functional variant in the proximal 850 bp of each of the genes in our data set has been identified. In fact, in many of the genes, little or none of the proximal region has yet been studied. Only a fraction of the identified variants have been tested experimentally, and of those many have not been tested under relevant conditions. On balance, we expect that our estimate of the number of functional _cis_-regulatory SNPs will prove low. As shown below, our primary conclusions regarding the distribution of human _cis_-regulatory variation are largely insensitive to our assumptions.

Given 0.94 functional SNPs/kb in the proximal promoter of each gene, and then by assuming an average heterozygosity for biallelic variants derived from our analysis (0.313; see table 1 ) and an estimate of 30,000 genes in the genome, we can extrapolate that on average a human is heterozygous for 7,500 functional SNPs in the proximal 850 bp of the 5′ flanking sequence. If we consider the 9.9% of biallelic variants in our data set that are not SNPs and the 22.5% of biallelic variants that fall outside the proximal 850 bp, we arrive at an estimate of about 10,700 functional biallelic _cis_-regulatory variants heterozygous in a typical human (table 2 ).

The preceding extrapolations are based on the best-studied variants, proximal promoter SNPs. We can also use the data set without breaking it down by spatial location. We found a total of 115 biallelic variants in 101 genes for which we have the appropriate data. If we extrapolate genewise, 1.1386 variants/gene times 30,000 genes times an average heterozygosity of 0.313 again yields an estimate of 10,700 functional biallelic _cis_-regulatory variants heterozygous in a typical human.

A Functional Role for Microsatellites

Multiallelic variants comprise 20% (29/144) of the variants in our data set, and among these, (AC)n microsatellites are overrepresented. Because these microsatellites have usually been considered nonfunctional neutral markers, we discount investigator bias as an explanation. We suggest instead that (AC)n repeats are major contributors to transcriptional regulation and hence to phenotypic variation in humans. Evidence for their role in transcriptional regulation is accumulating. In reporter constructs, (AC)n repeat number variation alone influences rates of transcription (Hamada et al. 1984 ), and their influence in the context of specific genomic sequences has been demonstrated experimentally for each of the 11 examples in our data set (table 3 ). At least one unidentified human transcription factor binds specifically to (AC)n sequences (Epplen, Kyas, and Maueler 1996 ; Shimajiri et al. 1999 ), and (AC)n sequences may influence transcription rates by their effects on DNA conformation even in the absence of specific protein binding (Naylor and Clark 1990 ; Wolfl et al. 1996 ; Rothenburg et al. 2001 ).

We can extrapolate from our data to the number of functional, polymorphic (AC)n repeats in the genome. The prevalence of (AC)n repeat elements is approximately 27.7/Mb, for more than 88,000 in the 3.2 Gb human genome, on the basis of stringent criteria which do not count short or imperfect repeats (International Human Genome Sequencing Consortium 2001 ). A survey of 2,506 (AC)n repeat elements found 93% to be polymorphic (Weissenbach et al. 1992 ), or about 82,000 for the whole genome. Note that this estimate is based on a study of unpreselected (AC)n elements with 12 or more repeats, such that the proportion polymorphic is likely a good estimate of the proportion genome wide. The variants in our data set occur in either orientation with respect to the sense strand and are widely distributed with respect to the start of transcription (table 3 ). We infer the presence of 3,800 functional, polymorphic (AC)n microsatellites in human populations, given three conservative assumptions: first, that repeats within 2.5 kb of the start of transcription (a 5-kb window) are functional, as implied by experimental data (Hamada et al. 1984 ; Epplen, Kyas, and Maueler 1996 ; Shimajiri et al. 1999 ; table 3 ) and that others are not (conservative because some repeats further afield are known to be functional); second, that (AC)n elements are randomly distributed throughout the genome (conservative because their distribution may actually be biased toward the vicinity of transcription start sites, Schroth, Chou, and Ho 1992 ); and third, that there are 30,000 starts of transcription (conservative because many genes have multiple start sites). The average heterozygosity for the (AC)n repeats in our data set is 0.695, which is typical for polymorphic dinucleotide repeats (Bowcock et al. 1994 ; Dib et al. 1996 ). Thus, we infer that a typical human is heterozygous for 2,600 functional (AC)n variants.

A functional role for (AC)n microsatellites has important consequences. Their high heterozygosity and high (5–12 × 10−4) mutation rate (Weissenbach et al. 1992 ; Weber and Wong 1993 ) may account for the rapid generation of phenotypic variation and its maintenance through bottleneck episodes or in the face of stabilizing or directional selection (Kashi, King, and Soller 1997 ). In addition, microsatellite length may itself be under stabilizing selection for _cis_-regulatory function, explaining their unexpectedly high level of conservation among apes and low repeat-number variance among humans (Bowcock et al. 1994 ; Deka et al. 1994 ).

Only 37.9% of the multiallelic variants in our data set are (AC)n. If our sample is representative of functional multiallelic variants, we may infer the presence of an additional 6,200 functional multiallelic polymorphisms segregating in human populations. In our data set, the non-(AC)n multiallelic variants have an average heterozygosity of 0.619, implying that each human is heterozygous at 3,900 non-(AC)n multiallelic functional _cis_-regulatory sites or 6,500 multiallelic sites in total (table 2 ).

An alternative approach to this extrapolation, abandoning the assumption of a ±2.5 kb functional domain, uses the observation that in 101 genes with appropriate data, we found 29 multiallelic variants; the average heterozygosity of functional multiallelic variants, based on the 23 for which we have frequency data, is 0.649. Extrapolating to 30,000 genes, we arrive at an estimate of 8,600 functional multiallelic variants, of which 5,600 will be heterozygous on average.

Genome-Wide Functional Variation: _cis_-Regulatory Versus Coding

From the separate extrapolations for bi- and multiallelic _cis_-regulatory variants, we approach a range of 16,300–17,200 functional variants that are heterozygous on average. If we extrapolate from the combined data, we can use the observations of 1.43 functional variants per gene (115 bi- and 29 multiallelic variants in 101 genes), an average heterozygosity of 0.373, and a total of 30,000 genes, to estimate 16,000 functional _cis_-regulatory sites heterozygous on average.

In contrast, a human is likely to be heterozygous at only about 7,900–12,900 amino acid positions, given estimates of heterozygosity at nonsynonymous sites (0.196/kb [Cargill et al. 1999] and 0.32/kb [Sunyaev et al. 2000] ), an average coding sequence length of 1,340 bp per gene (International Human Genome Sequencing Consortium 2001 ), and 30,000 genes. Only if our data set is dramatically unrepresentative, such that one-fifth of all genes have no segregating _cis_-regulatory variation in any human population, would the average number of heterozygous _cis_-regulatory sites be less than 12,900. The greater number of heterozygous functional _cis_-regulatory than coding sites is consistent with the results of recent interspecific analyses, which suggest that the number of noncoding nucleotides conserved between human and mouse exceeds the number of coding nucleotides (Frazer et al. 2001 ; Shabalina et al. 2001 ). In short, the size of the functional noncoding genome may simply exceed the size of the coding genome.

Four sources of bias contribute to these estimates. First, we assume that our genes are representative, although they are included specifically because of the presence of functional polymorphisms. We doubt, however, that this constitutes a major source of bias in practice; a consideration of the size of each gene's _cis_-regulatory region and the extent of human nucleotide diversity suggests that nearly all genes will have at least one _cis_-regulatory site segregating as a polymorphism in some human population. We can circumvent this potential bias by noting that our extrapolations for multiallelic variants do not rely on the assumption that our genes are representative of all genes, only that (AC)n microsatellites are randomly distributed and functional when near transcription start sites. If we accept the extrapolation for multiallelic variants, we can back-calculate heterozygosity at biallelic sites. We found previously that 10,000 functional multiallelic sites (3,800 [AC]n and 6,200 others) are segregating variants in human populations, and we found that multiallelic variants make up 20.1% of functional variants in our data set. Thus, by assuming that the proportion of multiallelic functional variants in our data set is a good estimator of the proportion among all functional variants, which we have no reason to doubt, we arrive at an estimate of 40,000 functional biallelic variants segregating in human populations. Given the estimated heterozygosity for such variants (0.313), we estimate an average of 12,500 heterozygous biallelic functional sites, slightly higher than the 10,700 estimated by assuming that the genes in the data set are representative. Second, our extrapolations are based on the assumption that all variants within our genes are known. In fact, for most of these genes only small segments of the _cis_-regulatory sequence have been searched for polymorphisms, and few of the known polymorphisms have been experimentally tested for function. Even then, the context-dependence of _cis_-regulatory function dictates that some functional variants will be missed and we will have underestimated the number of functional variants per gene. Third, we assume that there are 30,000 genes (i.e., sets of cotranscribed coding exons) and 30,000 starts of transcription (equivalently, first exons or promoters). These specific numbers are unimportant because our estimates of heterozygosities can be rescaled to accommodate any other number. But the ratio of genes to promoters is important. Many genes have multiple first exons and hence _cis_-regulatory regions, which increase the number of possible sites for functional _cis_-regulatory variants without affecting the number of possible sites for nonsynonymous coding variants. For example, UGT1A has at least nine separate first exons, each with its own promoter and start of transcription, spread over hundreds of kilobases (Gong et al. 2001 ). By assuming a 1:1 ratio of genes to start of transcription, we underestimate the ratio of functional _cis_-regulatory sites to amino acid sites that are heterozygous on average, given in this study as greater than 16,000:12,900. A correction for this bias would involve multiplying the estimated number of heterozygous functional _cis_-regulatory sites by the average number of promoters per gene, a number that is unfortunately unknown. Finally, we consider only polymorphisms, variants with rare allele frequencies greater than 1%, whereas surveys of amino acid polymorphism consider all variants, including rare singleton variants, which usually comprise about a third of polymorphisms identified in most surveys (e.g., Stephens et al. 2001 ). This difference again contributes to an underestimate of the ratio of _cis_-regulatory to amino acid sites that are heterozygous.

In sum, a bias among genes, which can be circumvented, contributes to an overestimate of the ratio, whereas biases among variants and frequencies, as well as our assumed gene:promoter ratio, contribute to an underestimate. The magnitudes of these biases are impossible to estimate, but on the whole, the data suggest that humans are on average heterozygous for more functional _cis_-regulatory sites than for amino acid sites. When additional functional noncoding variants are considered, such as those influencing mRNA stability and transport, splicing, and translation, the average number of heterozygous functional noncoding sites surely exceeds the number of heterozygous amino acid sites.

The Absence of Transposons

Although transposition is often considered an engine of regulatory evolution (Britten 1997 ; Brosius 1999 ; Carroll, Grenier, and Weatherbee 2001 , p. 180), we found no example in humans of a segregating polymorphism of this type that has been experimentally implicated in variation in transcription. The disparity between the well-documented role for transposition in interspecific regulatory evolution (Britten 1997 ; Brosius 1999 ) and the absence of transposition mutations in our data set may be due to a dramatic slowdown in transposition rates in humans (International Human Genome Sequencing Consortium 2001 ). According to this model, transposition, although once important, no longer plays a major role in regulatory evolution in our species. A second possibility is that transposons typically have dramatic fitness consequences, such that they are either quickly fixed in populations or lost but rarely segregate as polymorphisms. Although rare transposition events with severe clinical phenotypes are documented in humans (Deininger and Batzer 1999 ), supporting the latter scenario, neither explanation is complete: segregating LINE-1 and Alu polymorphisms are well documented and likely number more than 2,000 (Sheen et al. 2000 ). Their absence from our data set may simply indicate that researchers have not yet focused on their functional consequences or that the numbers of transcriptionally relevant polymorphisms due to transposition are so small relative to other classes of polymorphisms that we would not expect to find them in our sample. Proportionally, polymorphisms due to transposition are clearly minor contributors to transcriptional variation in humans.

Distinctive Genotype-Phenotype Dynamics of _Cis_-Regulatory Variants

Our survey of known functional _cis_-regulatory variants has revealed a number of distinctive characteristics of the mode of action of these variants, with implications for their evolution and their role in phenotypic variation. Although differences among the studies in our survey prevent us from drawing strict quantitative conclusions about these characteristics, we believe that by outlining them and providing some empirical examples we can help focus future research on _cis_-regulatory variation.

First, while transcription factors typically regulate many downstream target genes, the identities of the target genes will differ among individuals on the basis of the variation in _cis_-regulatory DNA sequences. Consequently, changes in expression or structure of a transcription factor will not only be highly pleiotropic, influencing many downstream loci, but also highly epistatic, that is, dependent on genetic background. That this sort of _cis_-trans epistasis is common is suggested by the occurrence of multiple polymorphic binding sites for individual transcription factors, especially Sp1, and by the occurrence in our data set of four transcription factors whose _cis_-regulation is polymorphic (IRF1, PAX3, PAX6, and VDR). Moreover, _cis_-trans epistasis is corroborated by experimental studies of genes in our data set. For example, Rutter et al. (1998) , using 4 kb reporter constructs of the MMP1 gene, differing only by a single nucleotide indel, found repeatable differences in the effect on transcription when the constructs were transfected into fibroblast cells from four different donors. At the level of organismal phenotype as well, background effects are well documented for _cis_-regulatory variants; for example, national origin interacts with variants at the SCYA5 and CCR5 loci to influence disease progression (Gonzalez et al. 1999 , 2001 ).

Second, epistasis in cis, whereby the effect on transcription of one _cis_-regulatory variant depends on covariation at a linked _cis_-regulatory site, is common in our data set, including, for example, among SNPs at the LIPC locus (Botma, Verhoeven, and Jansen 2001 ), among SNPs and a VNTR at IL6 (Terry, Loukaci, and Green 2000 ), and among VNTRs at COL1A2 (Akai, Kimura, and Hata 1999 ). A second form of _cis_-epistasis, in which organismal phenotype depends on linkage phase between _cis_-regulatory variants and amino acid polymorphisms, is also common. For example, this haplotype-dependent dosage-by-structure interaction is documented for PON1 (James et al. 2000 ) and APOE (Lambert et al. 1998 ). The ubiquity of these two classes of epistasis in cis lends credence to the classical idea of the coadapted gene complex (Dobzhansky 1951 , p. 278) and the Lewontinian notion of the genome as the unit of selection (Lewontin 1974 , pp. 273–318). It may also point to the viability of epistatic selection as an explanation for irregular patterns of linkage disequilibrium in humans (Stephens et al. 2001 ). At minimum, the frequent occurrence of linked, interacting functional sites mandates that models for the evolution of the human gene sequences incorporate the interaction of linkage and epistasis.

Third, we note that many of the genes with regulatory polymorphisms interact with one another in regulatory, metabolic, and physiological networks. As a consequence, moderate levels of variation at any one locus will be amplified by the number of genes in the network to yield high levels of polymorphism at the network level. For example, the genes APOA1, APOA2, APOB, APOC1, APOC3, IGFBP3 IGF2, IL1, IL4, IL6, IL10, INS, FGB, LPL, LIPC, LIPE, TNF, and TGFB all have functional _cis_-regulatory polymorphisms and interact with one another directly or indirectly. Analytical models and simulations have shown that polymorphism distributed through a network of interacting genes may result in a nonlinear genotype-phenotype relationship (that is, one characterized by dominance and epistasis) as an emergent property of lower level additive gene action (Nijhout and Paulsen 1997 ; Gilchrist and Nijhout 2001 ), and such multilocus phenomena are now being reported in the literature on human _cis_-regulatory variation (Jansen et al. 2001 ).

Fourth, a large proportion of the variants in our data set exhibit differential responses to exogenous inducers. In some cases, this means that alleles differ in the magnitude of inductive response—for example, phorbol ester stimulation interacts with an SNP at the MGP locus (Farzaneh-Far et al. 2001 ) and with a VNTR at the PDYN locus (Zimprich et al. 2000 )—but in other cases each allele shows a different direction of response, as at the SLC11A1 locus, at which the combination of interferon-γ and bacterial lipopolysaccharide leads to the upregulation of one allele relative to the basal transcription rate but leads to repression of transcription from another allele (Searle and Blackwell 1999 ). Because environmental cues are often mediated through transcriptional regulation and because inducibility is a pervasive property of human genes (Iyer et al. 1999 ), _cis_-regulatory polymorphism likely constitutes a major genetic basis for genotype-by-environment interaction effects. At the organismal phenotypic level, such effects are well documented, as for example at the CETP locus, where a _cis_-regulatory SNP and alcohol consumption interact in the modulation of HDL-cholesterol levels (Corbex et al. 2000 ).

Fifth, although _cis_-regulatory sequences are often assumed to influence discrete aspects of transcription of a single gene and so to be the basis for developmental modularity (Stern 2000 ; Carroll, Grenier, and Weatherbee 2001 , pp. 91–92), our data set includes two instances of the opposite phenomenon: single variants influencing transcription from multiple genes. The imperfect VNTR upstream of the INS gene influences transcription from both INS and the downstream IGF2 locus (Kennedy, German, and Rutter 1995 ; Paquette et al. 1998 ), and SNPs upstream of APOC3 influence transcription of that gene in the liver (Li et al. 1995 ) and transcription of the APOA1 gene, which is transcribed convergently to APOC3, in the colon (Naganawa et al. 1997 ). The prevalence of this sort of multifunctional noncoding DNA variation and its attendant pleiotropic consequences have likely been underestimated, in part because most experimental studies of regulatory variants focus only on the effects on transcription of the nearest gene.

Sixth, we found many loci at which alleles produce spatially or temporally nonnested gene expression patterns such that heterozygotes have patterns of expression beyond the range of either homozygote. Although such alleles may act additively at the level of number of transcripts produced, they exhibit overdominance at the level of the number of nuclei or tissues with a given rate of transcription. This single-locus tissue-type overdominance may represent a selective mechanism for the maintenance of _cis_-regulatory variation (Guardiola et al. 1996 ). For example, one allele at the HLA-DQB1 locus shows higher expression than another in primary skin cells, whereas in peripheral blood mononuclear cells and B-lymph cells the situation is reversed (Beaty, Sukiennicki, and Nepom 1999 ). Such behavior is not limited to the MHC complex; similar dynamics are evident at such loci as AGT (Zhao et al. 1999 ), INS (Pugliese et al. 1997 ), and NOS2A (Morris et al. 2001 ).

Finally, because _cis_-regulatory sequences respond to spatially and temporally regulated transcription factors, _cis_-regulatory polymorphism represents an important genetic basis for the evolution of development by way of heterotopy and heterochrony. Our data set includes several examples of polymorphic temporal and spatial regulation. For example, a polymorphism at the FY locus affects expression in erythroid cells but not in other tissues (Tournamille et al. 1995 ). _Cis_-regulatory variants at the HBG2 locus alter the time course of expression, resulting in variable fetal hemoglobin expression among adults (Labie et al. 1985 ). Developmentally important variants are likely vastly underrepresented in our data set because of the impossibility of carrying out the necessary experiments in developing human embryos.

Conclusions

The human genome is brimming with functional variation that influences transcription and thus many aspects of phenotype. The _cis_-regulatory polymorphisms analyzed in this study, scattered across a diversity of interacting genes, point to the surprising extent of regulatory variation in humans and underscore the complexity of the genotype-phenotype relationship. _Cis_-regulatory polymorphism represents the intersection of the central themes of development and evolution: the role of differential gene expression in development (Davidson 2001 ) and the evolutionary origin of species differences as variation within populations (Haldane 1932 ; Dobzhansky 1951 ). Our species is depauperate in sequence variation compared with many others (Zwick, Cutler, and Chakravarti 2000 ); we expect that comparable or greater levels of functional variation exist within the _cis_-regulatory regions of other eukaryotes.

Supplementary Material

Tables enumerating the genes, variants, and frequencies used in our study, along with references for source studies, are available at the Molecular Biology and Evolution website http://www.smbe.org.

Stephen Palumbi, Reviewing Editor

Keywords: promoter polymorphism gene regulation evolution of development gene network transcription

Address for correspondence and reprints: Matt Rockman, Department of Biology, Duke University, Box 90338, Durham, North Carolina 27708. mrockman@duke.edu

Table 1 Mean Heterozygosities and Rare Allele Frequencies for Functional Cis -Regulatory Polymorphisms

Table 1 Mean Heterozygosities and Rare Allele Frequencies for Functional Cis -Regulatory Polymorphisms

Table 1 Mean Heterozygosities and Rare Allele Frequencies for Functional Cis -Regulatory Polymorphisms

Table 1 Mean Heterozygosities and Rare Allele Frequencies for Functional Cis -Regulatory Polymorphisms

Table 2 Distribution of Functional Variation Within the Human Genome

Table 2 Distribution of Functional Variation Within the Human Genome

Table 2 Distribution of Functional Variation Within the Human Genome

Table 2 Distribution of Functional Variation Within the Human Genome

Table 3 Distribution of Known Functional Polymorphic (AC)n Microsatellites

Table 3 Distribution of Known Functional Polymorphic (AC)n Microsatellites

Table 3 Distribution of Known Functional Polymorphic (AC)n Microsatellites

Table 3 Distribution of Known Functional Polymorphic (AC)n Microsatellites

Fig. 1.—Chromosomal locations of human genes with experimentally validated functional cis-regulatory polymorphisms. The idiogram is modified from David Adler's Idiogram Album (http://www.pathology.washington.edu/cytopages/idiograms/human/), and the genes are assigned to positions following the annotation in the human genome working draft (http://genome.ucsc.edu/). Gene symbols are according to the Human Genome Organization Gene Nomeclature Committee (http://www.gene.ucl.ac.uk/nomenclature/)

Fig. 2.—Functional polymorphisms in human cis-regulatory DNA. A, Maximum experimentally determined fold-difference in transcription rate between alleles of each gene. B, Distribution of mutation types. VNTR includes only multiallelic tandem repeat polymorphisms. “Other” includes one multiallelic hypervariable region (KLK1), one multiallelic complex of repeats, snps, and inversions (IGHA1), and one biallelic locus at which the alleles differ by multiple overlapping changes of length and sequence (SERPINC1). C, Distribution of SNP types without regard to orientation (A-G transitions on one strand are C-T transitions on the other strand; see Materials and Methods)

Fig. 2.—Functional polymorphisms in human _cis_-regulatory DNA. A, Maximum experimentally determined fold-difference in transcription rate between alleles of each gene. B, Distribution of mutation types. VNTR includes only multiallelic tandem repeat polymorphisms. “Other” includes one multiallelic hypervariable region (KLK1), one multiallelic complex of repeats, snps, and inversions (IGHA1), and one biallelic locus at which the alleles differ by multiple overlapping changes of length and sequence (SERPINC1). C, Distribution of SNP types without regard to orientation (A-G transitions on one strand are C-T transitions on the other strand; see Materials and Methods)

Fig. 3.—Rare allele frequencies for biallelic functional cis-regulatory polymorphisms in humans. A, Loci are ordered along the x-axis by the midpoints of the observed frequency range, as indicated by the diamond symbols. Each cross hatch represents a frequency estimate for a single study population, typically derived from a sample of hundreds of chromosomes. Complete sample details are available as supplementary material. B, The histogram of midpoint frequencies differs in shape from C, that expected under neutrality

Fig. 3.—Rare allele frequencies for biallelic functional _cis_-regulatory polymorphisms in humans. A, Loci are ordered along the _x_-axis by the midpoints of the observed frequency range, as indicated by the diamond symbols. Each cross hatch represents a frequency estimate for a single study population, typically derived from a sample of hundreds of chromosomes. Complete sample details are available as supplementary material. B, The histogram of midpoint frequencies differs in shape from C, that expected under neutrality

We thank the National Science Foundation and NASA for support and E. Davidson, G. Gibson, M. Hahn, M. Rausher, J. Stone, K. Zigler, R. Zufall, and members of the Wray lab for helpful suggestions.

References

Akai J., A. Kimura, R. I. Hata,

1999

Transcriptional regulation of the human type I collagen α2 (COL1A2) gene by the combination of two dinucleotide repeats

Gene

239

:

65

-73

Baier L. J., P. A. Permana, X. Yang, et al. (13 co-authors)

2000

A calpain-10 gene polymorphism is associated with reduced muscle mRNA levels and insulin resistance

J. Clin. Investig

106

:

R69

-R73

Beaty J. S., T. L. Sukiennicki, G. T. Nepom,

1999

Allelic variation in transcription modulates MHC class II expression and function

Microbes Infect

1

:

919

-927

Beutler E., T. Gelbart, A. Demina,

1998

Racial variability in the UDP-glucuronosyltransferase 1 (UGT1A1) promoter: a balanced polymorphism for regulation of bilirubin metabolism?

Proc. Natl. Acad. Sci. USA

95

:

8170

-8174

Botma G. J., A. J. Verhoeven, H. Jansen,

2001

Hepatic lipase promoter activity is reduced by the C-480T and G-216A substitutions present in the common LIPC gene variant, and is increased by Upstream Stimulatory Factor

Atherosclerosis

154

:

625

-632

Bowcock A. M., A. Ruiz-Linares, J. Tomfohrde, E. Minch, J. R. Kidd, L. L. Cavalli-Sforza,

1994

High resolution of human evolutionary trees with polymorphic microsatellites

Nature

368

:

455

-457

Brem R. B., G. Yvert, R. Clinton, L. Kruglyak,

2002

Genetic dissection of transcriptional regulation in budding yeast

Science

296

:

752

-755

Britten R. J.,

1997

Mobile elements inserted in the distant past have taken on important functions

Gene

205

:

177

-182

Britten R. J., E. H. Davidson,

1969

Gene regulation for higher cells: a theory

Science

165

:

349

-357

Brosius J.,

1999

Genomes were forged by massive bombardments with retroelements and retrosequences

Genetica

107

:

209

-238

Cambien F., O. Poirier, V. Nicaud, et al. (16 co-authors)

1999

Sequence diversity in 36 candidate genes for cardiovascular disorders

Am. J. Hum. Genet

65

:

183

-191

Carey M., S. T. Smale,

2000

Transcriptional regulation in eukaryotes Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York

Cargill M., D. Altshuler, J. Ireland, et al. (18 co-authors)

1999

Characterization of single-nucleotide polymorphisms in coding regions of human genes

Nat. Genet

22

:

231

-238

Carroll S. B., J. K. Grenier, S. D. Weatherbee,

2001

From DNA to diversity: molecular genetics and the evolution of animal design Blackwell Science, London

Cavalieri D., J. P. Townsend, D. L. Hartl,

2000

Manifold anomalies in gene expression in a vineyard isolate of Saccharomyces cerevisiae revealed by DNA microarray analysis

Proc. Natl. Acad. Sci. USA

97

:

12369

-12374

Corbex M., O. Poirier, F. Fumeron, D. Betoulle, A. Evans, J. B. Ruidavets, D. Arveiler, G. Luc, L. Tiret, F. Cambien,

2000

Extensive association analysis between the CETP gene and coronary heart disease phenotypes reveals several putative functional polymorphisms and gene-environment interaction

Genet. Epidemiol

19

:

64

-80

Crawford D. L., J. A. Segal, J. L. Barnett,

1999

Evolutionary analysis of TATA-less proximal promoter function

Mol. Biol. Evol

16

:

194

-207

Damerval C., A. Maurice, J. M. Josse, D. de Vienne,

1994

Quantitative trait loci underlying gene product variation: a novel perspective for analyzing regulation of genome expression

Genetics

137

:

289

-301

Davidson E. H.,

2001

Genomic regulatory systems: development and evolution Academic Press, San Diego

Deininger P. L., M. A. Batzer,

1999

Alu repeats and human disease

Mol. Genet. Metab

67

:

183

-193

Deka R., M. D. Shriver, L. M. Yu, L. Jin, C. E. Aston, R. Chakraborty, R. E. Ferrell,

1994

Conservation of human chromosome 13 polymorphic microsatellite (CA)n repeats in chimpanzees

Genomics

22

:

226

-230

Dib C., S. Faure, C. Fizames, et al. (14 co-authors)

1996

A comprehensive genetic map of the human genome based on 5,264 microsatellites

Nature

380

:

152

-154

Dobzhansky T.,

1951

Genetics and the origin of species Columbia University Press, New York

Enard W., P. Khaitovich, J. Klose, et al. (13 co-authors)

2002

Intra- and interspecific variation in primate gene expression patterns

Science

296

:

340

-343

Epplen J. T., A. Kyas, W. Maueler,

1996

Genomic simple repetitive DNAs are targets for differential binding of nuclear proteins

FEBS Lett

389

:

92

-95

Farzaneh-Far A., J. D. Davies, L. A. Braam, H. M. Spronk, D. Proudfoot, S. W. Chan, K. M. O'Shaughnessy, P. L. Weissberg, C. Vermeer, C. M. Shanahan,

2001

A polymorphism of the human matrix gamma-carboxyglutamic acid protein promoter alters binding of an activating protein-1 complex and is associated with altered transcription and serum levels

J. Biol. Chem

276

:

32466

-32473

Frazer K. A., J. B. Sheehan, R. P. Stokowski, X. Chen, R. Hosseini, J. F. Cheng, S. P. Fodor, D. R. Cox, N. Patil,

2001

Evolutionarily conserved sequences on human chromosome 21

Genome Res

11

:

1651

-1659

Gilchrist M. A., H. F. Nijhout,

2001

Nonlinear developmental processes as sources of dominance

Genetics

159

:

423

-432

Gong Q. H., J. W. Cho, T. Huang, et al. (11 co-authors)

2001

Thirteen UDPglucuronosyltransferase genes are encoded at the human UGT1 gene complex locus

Pharmacogenetics

11

:

357

-368

Gonzalez E., M. Bamshad, N. Sato, et al. (22 co-authors)

1999

Race-specific HIV-1 disease-modifying effects associated with CCR5 haplotypes

Proc. Natl. Acad. Sci. USA

96

:

12004

-12009

Gonzalez E., R. Dhanda, M. Bamshad, et al. (16 co-authors)

2001

Global survey of genetic variation in CCR5, RANTES, and MIP-1α: impact on the epidemiology of the HIV-1 pandemic

Proc. Natl. Acad. Sci. USA

98

:

5199

-5204

Guardiola J., A. Maffei, R. Lauster, N. A. Mitchison, R. S. Accolla, S. Sartoris,

1996

Functional significance of polymorphism among MHC class II gene promoters

Tissue Antigens

48

:

615

-625

Haldane J. B. S.,

1932

The causes of evolution Longmans, Green and Co., London

Halushka M. K., J. B. Fan, K. Bentley, L. Hsie, N. Shen, A. Weder, R. Cooper, R. Lipshutz, A. Chakravarti,

1999

Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis

Nat. Genet

22

:

239

-247

Hamada H., M. Seidman, B. H. Howard, C. M. Gorman,

1984

Enhanced gene expression by the poly(dT-dG)•poly(dC-dA) sequence

Mol. Cell. Biol

4

:

2622

-2630

Hamblin M. T., A. Di Rienzo,

2000

Detection of the signature of natural selection in humans: evidence from the Duffy blood group locus

Am. J. Hum. Genet

66

:

1669

-1679

Hartl D. L., A. G. Clark,

1997

Principles of population genetics Sinauer, Sunderland, Mass

Inoue I., T. Nakajima, C. S. Williams, et al. (12 co-authors)

1997

A nucleotide substitution in the promoter of human angiotensinogen is associated with essential hypertension and affects basal transcription in vitro

J. Clin. Investig

99

:

1786

-1797

International Human Genome Sequencing Consortium.

2001

Initial sequencing and analysis of the human genome

Nature

409

:

860

-921

Iyer V. R., M. B. Eisen, D. T. Ross, et al. (14 co-authors)

1999

The transcriptional program in the response of human fibroblasts to serum

Science

283

:

83

-87

James R. W., I. Leviev, J. Ruiz, P. Passa, P. Froguel, M. C. Garin,

2000

Promoter polymorphism T(–107)C of the paraoxonase PON1 gene is a risk factor for coronary heart disease in type 2 diabetic patients

Diabetes

49

:

1390

-1393

Jansen H., D. M. Waterworth, V. Nicaud, C. Ehnholm, P. J. Talmud,

2001

Interaction of the common apolipoprotein C-III (_APOC3_−482C > T) and hepatic lipase (_LIPC_−514C > T) promoter variants affects glucose tolerance in young adults

Ann. Hum. Genet

65

:

237

-243

Jin W., R. M. Riley, R. D. Wolfinger, K. P. White, G. Passador-Gurgel, G. Gibson,

2001

The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster

Nat. Genet

29

:

389

-395

Kashi Y., D. King, M. Soller,

1997

Simple sequence repeats as a source of quantitative genetic variation

Trends Genet

13

:

74

-78

Kennedy G. C., M. S. German, W. J. Rutter,

1995

The minisatellite in the diabetes susceptibility locus IDDM2 regulates insulin transcription

Nat. Genet

9

:

293

-298

King M. C., A. C. Wilson,

1975

Evolution at two levels in humans and chimpanzees

Science

188

:

107

-116

Labie D., J. Pagnier, C. Lapoumeroulie, F. Rouabhi, O. Dunda-Belkhodja, P. Chardin, C. Beldjord, H. Wajcman, M. E. Fabry, R. L. Nagel,

1985

Common haplotype dependency of high G gamma-globin gene expression and high Hb F levels in beta-thalassemia and sickle cell anemia patients

Proc. Natl. Acad. Sci. USA

82

:

2111

-2114

Lambert J. C., F. Pasquier, D. Cottel, B. Frigard, P. Amouyel, M. C. Chartier-Harlin,

1998

A new polymorphism in the APOE promoter associated with risk of developing Alzheimer's disease

Hum. Mol. Genet

7

:

533

-540

LeVan T. D., J. W. Bloom, T. J. Bailey, C. L. Karp, M. Halonen, F. D. Martinez, D. Vercelli,

2001

A common single nucleotide polymorphism in the CD14 promoter decreases the affinity of Sp protein binding and enhances transcriptional activity

J. Immunol

167

:

5838

-5844

Lewontin R. C.,

1972

The apportionment of human diversity

Evol. Biol

6

:

381

-398

———.

1974

The genetic basis of evolutionary change Columbia University Press, New York

Li W. W., M. M. Dammerman, J. D. Smith, S. Metzger, J. L. Breslow, T. Leff,

1995

Common genetic variation in the promoter of the human apo CIII gene abolishes regulation by insulin and may contribute to hypertriglyceridemia

J. Clin. Investig

96

:

2601

-2605

Li W.-H., L. A. Sadler,

1991

Low nucleotide diversity in man

Genetics

129

:

513

-523

McCarthy M.,

1998

Weighing in on diabetes risk

Nat. Genet

19

:

209

-210

Morris B. J., C. L. Glenn, D. E. Wilcken, X. L. Wang,

2001

Influence of an inducible nitric oxide synthase promoter variant on clinical variables in patients with coronary artery disease

Clin. Sci. (Lond.)

100

:

551

-556

Nachman M. W., S. L. Crowell,

2000

Estimate of the mutation rate per nucleotide in humans

Genetics

156

:

297

-304

Naganawa S., H. N. Ginsberg, R. M. Glickman, G. S. Ginsburg,

1997

Intestinal transcription and synthesis of apolipoprotein AI is regulated by five natural polymorphisms upstream of the apolipoprotein CIII gene

J. Clin. Investig

99

:

1958

-1965

Naylor L. H., E. M. Clark,

1990

d(TG)n•d(CA)n sequences upstream of the rat prolactin gene form Z-DNA and inhibit gene transcription

Nucleic Acids Res

18

:

1595

-1601

Neel J. V.,

1962

Diabetes mellitus: a “thrifty” genotype rendered detrimental by “progress”?

Am. J. Hum. Genet

14

:

353

-362

Nijhout H. F., S. M. Paulsen,

1997

Developmental models and polygenic characters

Am. Nat

149

:

394

-405

Paquette J., N. Giannoukakis, C. Polychronakos, P. Vafiadis, C. Deal,

1998

The INS 5′ variable number of tandem repeats is associated with IGF2 expression in humans

J. Biol. Chem

273

:

14158

-14164

Pugliese A., M. Zeller, A. Fernandez Jr.,, L. J. Zalcberg, R. J. Bartlett, C. Ricordi, M. Pietropaolo, G. S. Eisenbarth, S. T. Bennett, D. D. Patel,

1997

The insulin gene is transcribed in the human thymus and transcription levels correlated with allelic variation at the INS VNTR-IDDM2 susceptibility locus for type 1 diabetes

Nat. Genet

15

:

293

-297

Rothenburg S., F. Koch-Nolte, A. Rich, F. Haag,

2001

A polymorphic dinucleotide repeat in the rat nucleolin gene forms Z-DNA and inhibits promoter activity

Proc. Natl. Acad. Sci. USA

98

:

8985

-8990

Rutter J. L., T. I. Mitchell, G. Buttice, J. Meyers, J. F. Gusella, L. J. Ozelius, C. E. Brinckerhoff,

1998

A single nucleotide polymorphism in the matrix metalloproteinase-1 promoter creates an Ets binding site and augments transcription

Cancer Res

58

:

5321

-5325

Schliekelman P., C. Garner, M. Slatkin,

2001

Natural selection and resistance to HIV

Nature

411

:

545

-546

Schroth G. P., P. J. Chou, P. S. Ho,

1992

Mapping Z-DNA in the human genome. Computer-aided mapping reveals a nonrandom distribution of potential Z-DNA-forming sequences in human genes

J. Biol. Chem

267

:

11846

-11855

Schulte P. M., H. C. Glemet, A. A. Fiebig, D. A. Powers,

2000

Adaptive variation in lactate dehydrogenase-B gene expression: role of a stress-responsive regulatory element

Proc. Natl. Acad. Sci. USA

97

:

6597

-6602

Searle S., J. M. Blackwell,

1999

Evidence for a functional repeat polymorphism in the promoter of the human NRAMP1 gene that correlates with autoimmune versus infectious disease susceptibility

J. Med. Genet

36

:

295

-299

Shabalina S. A., A. Y. Ogurtsov, V. A. Kondrashov, A. S. Kondrashov,

2001

Selective constraint in intergenic regions of human and mouse genomes

Trends Genet

17

:

373

-376

Sheen F. M., S. T. Sherry, G. M. Risch, M. Robichaux, I. Nasidze, M. Stoneking, M. A. Batzer, G. D. Swergold,

2000

Reading between the LINEs: human genomic variation induced by LINE-1 retrotransposition

Genome Res

10

:

1496

-1508

Shimajiri S., N. Arima, A. Tanimoto, Y. Murata, T. Hamada, K. Y. Wang, Y. Sasaguri,

1999

Shortened microsatellite d(CA)21 sequence down-regulates promoter activity of matrix metalloproteinase 9 gene

FEBS Lett

455

:

70

-74

Stam L. F., C. C. Laurie,

1996

Molecular dissection of a major gene effect on a quantitative trait: the level of alcohol dehydrogenase expression in Drosophila melanogaster

Genetics

144

:

1559

-1564

Stephens J. C., J. A. Schneider, D. A. Tanguay, et al. (28 co-authors)

2001

Haplotype variation and linkage disequilibrium in 313 human genes

Science

293

:

489

-493

Stern D. L.,

2000

Evolutionary developmental biology and the problem of variation

Evolution

54

:

1079

-1091

Stone J. R., G. A. Wray,

2001

Rapid evolution of _cis_-regulatory sequences via local point mutations

Mol. Biol. Evol

18

:

1764

-1770

Sucena E., D. L. Stern,

2000

Divergence of larval morphology between Drosophila sechellia and its sibling species caused by _cis_-regulatory evolution of ovo/shaven-baby

Proc. Natl. Acad. Sci. USA

97

:

4530

-4534

Sunyaev S., J. Hanke, D. Brett, A. Aydin, I. Zastrow, W. Lathe, P. Bork, J. Reich,

2000

Individual variation in protein-coding sequences of human genome

Adv. Protein Chem

54

:

409

-437

Taillon-Miller P., Z. Gu, Q. Li, L. Hillier, P. Y. Kwok,

1998

Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms

Genome Res

8

:

748

-754

Taillon-Miller P., P. Y. Kwok,

2000

A high-density single-nucleotide polymorphism map of Xq25-q28

Genomics

65

:

195

-202

Tautz D.,

2000

Evolution of transcriptional regulation

Curr. Opin. Genet. Dev

10

:

575

-579

Terry C. F., V. Loukaci, F. R. Green,

2000

Cooperative influence of genetic polymorphisms on interleukin 6 transcriptional regulation

J. Biol. Chem

275

:

18138

-18144

Tournamille C., Y. Colin, J. P. Cartron, C. Le Van Kim,

1995

Disruption of a GATA motif in the Duffy gene promoter abolishes erythroid gene expression in Duffy-negative individuals

Nat. Genet

10

:

224

-228

Venter J. C., M. D. Adams, E. W. Myers, et al. (271 co-authors)

2001

The sequence of the human genome

Science

291

:

1304

-1351

Weber J. L., C. Wong,

1993

Mutation of human short tandem repeats

Hum. Mol. Genet

2

:

1123

-1128

Weissenbach J., G. Gyapay, C. Dib, A. Vignal, J. Morissette, P. Millasseau, G. Vaysseix, M. Lathrop,

1992

A second-generation linkage map of the human genome

Nature

359

:

794

-801

Wilson A. G., J. A. Symons, T. L. McDowell, H. O. McDevitt, G. W. Duff,

1997

Effects of a polymorphism in the human tumor necrosis factor alpha promoter on transcriptional activation

Proc. Natl. Acad. Sci. USA

94

:

3195

-3199

Wolfl S., C. Martinez, A. Rich, J. A. Majzoub,

1996

Transcription of the human corticotropin-releasing hormone gene in NPLC cells is correlated with Z-DNA formation

Proc. Natl. Acad. Sci. USA

93

:

3664

-3668

Yu N., Z. Zhao, Y. X. Fu, et al. (11 co-authors)

2001

Global patterns of human DNA sequence variation in a 10-kb region on chromosome 1

Mol. Biol. Evol

18

:

214

-222

Zhao Y. Y., J. Zhou, C. S. Narayanan, Y. Cui, A. Kumar,

1999

Role of C/A polymorphism at −20 on the expression of human angiotensinogen gene

Hypertension

33

:

108

-115

Zimprich A., J. Kraus, M. Woltje, P. Mayer, E. Rauch, V. Hollt,

2000

An allelic variation in the human prodynorphin gene promoter alters stimulus-induced expression

J. Neurochem

74

:

472

-477

Zwick E. M., D. J. Cutler, A. Chakravarti,

2000

Patterns of genetic variation in mendelian and complex traits

Annu. Rev. Genomics Hum. Genet

1

:

387

-407

Citations

Views

Altmetric

Metrics

Total Views 3,649

2,879 Pageviews

770 PDF Downloads

Since 11/1/2016

Month: Total Views:
November 2016 1
February 2017 11
March 2017 14
April 2017 18
May 2017 10
June 2017 15
July 2017 3
August 2017 4
September 2017 7
October 2017 14
November 2017 23
December 2017 47
January 2018 82
February 2018 36
March 2018 49
April 2018 27
May 2018 28
June 2018 25
July 2018 35
August 2018 34
September 2018 32
October 2018 25
November 2018 21
December 2018 36
January 2019 18
February 2019 27
March 2019 37
April 2019 45
May 2019 40
June 2019 32
July 2019 46
August 2019 48
September 2019 12
October 2019 45
November 2019 60
December 2019 39
January 2020 41
February 2020 24
March 2020 23
April 2020 35
May 2020 18
June 2020 27
July 2020 21
August 2020 35
September 2020 23
October 2020 28
November 2020 37
December 2020 28
January 2021 12
February 2021 22
March 2021 47
April 2021 46
May 2021 20
June 2021 25
July 2021 27
August 2021 26
September 2021 191
October 2021 55
November 2021 27
December 2021 25
January 2022 30
February 2022 36
March 2022 24
April 2022 27
May 2022 17
June 2022 27
July 2022 41
August 2022 18
September 2022 51
October 2022 32
November 2022 15
December 2022 22
January 2023 19
February 2023 10
March 2023 28
April 2023 27
May 2023 14
June 2023 16
July 2023 17
August 2023 24
September 2023 764
October 2023 77
November 2023 25
December 2023 36
January 2024 42
February 2024 68
March 2024 52
April 2024 40
May 2024 30
June 2024 17
July 2024 20
August 2024 23
September 2024 19
October 2024 32

Citations

306 Web of Science

×

Email alerts

Email alerts

Citing articles via

More from Oxford Academic