Restarting the function wars (The Function Wars Part V) (original) (raw)

The term "function wars" refers to debates over the meaning of the word "function" in biology. It refers specifically to the discussion about junk DNA because junk DNA is defined as DNA that does not have a biological function. The wars were (re-)started when the ENCODE Consortium decided to use a stupid definition of function in order to prove that most of our genome was functional. This prompted a number of papers attempting to create a more meaningful definition.

None of them succeeded, in my opinion, because biology is messy and doesn't lend itself to precise definitions. Look how difficult it is to define a "gene," for example. Or "evolution."

Nevertheless, some progress was made. Dan Graur has recently posted a summary of the two most important definitions of function [What does “function” mean in the context of evolution & what absurd situations may arise by using the wrong definition?]. The two definitions are "selected-effect" and "causal-role" (there are synonyms).

Selected-effect Function

At the risk of oversimplifying, a function is only a true biological function if it is selected. If you're trying to decide whether a particular DNA sequence has a biological function then you look to see if it is conserved. Sequence conservation is the primary indicator of function.

Causal-role Function

The other way of looking at function is to ask whether the DNA sequence does something. If it encodes a functional protein, for example, it has an obvious causal role. However, in this case it may also be conserved so an example like that doesn't distinguish between the two definitions. In fact, as Dan Graur (and others) point out, everything that meets the selected-effect definition (i.e. conserved) should also have a causal role.

The world is not inhabited exclusively by fools and when a subject arouses intense interest and debate, as this one has, something other than semantics is usually at stake.

Stephan Jay Gould (1982)

We want to know about cases where a given DNA sequence does something (causal-role) but isn't conserved. Are those still examples of biological functions that don't meet the strict criterion of being conserved (i.e. selected by natural selection)? Conversely, are there examples of conserved sequences that don't have a real biological function?

The answer to both question is "yes" and that's why the selected-effect definition is not the definitive answer.

Before continuing, let me make it clear that sequence conservation is far and away the best indication of function that we have. It works almost all the time. As a first approximation, it's okay to insist on conservation as a crude definition of function as long as you aren't dogmatic about it. The selected-effect definition might be as good as it gets even though it's not perfect.

A sequence does something but isn't conserved

The best examples here are the ones mistakenly used by ENCODE. They identified transcription factor binding sites throughout the genome and suggested they all have a role in regulating gene expression. The second part of that claim is unproven—they may or may not have a biological role in regulation. It's very unlikely that they all have such a role.

The first part of the claim—that there are many transcription factor binding sites—is true. The DNA sites have a causal role in binding transcription factors. However, they are not conserved. When you look at the same locus in related species you often find that the sequence is different and the transcription factor will not bind to that site. Thus, the sequence might be functional according to the causal-role definition but it is not functional according to the selected-effect definition. In this case, the selected-effect definition trumps the causal-role definition and the DNA sequence does not have a biological function.1

Let's look at a more complicated example. Intron sequences are mostly junk but there's a minimal size of intron that's necessary for proper splicing. In most eukaryotes, it seems to be about 50 bp,4 or enough to form the loop of RNA that's required to bring the 5′ and 3′ splice sites together in the spliceosome. Most of that sequence isn't conserved but it definitely has a role to play in proper splicing.

Similarly, the spacing of transcription factor binding sites is also important in formation of loops of DNA that bring together the bound transcription factor and the RNA polymerase complex poised at the promoter. The spacer sequences are necessary but they aren't conserved.

Now, you could ague that the presence of a minimal spacer sequence IS conserved even though the actual DNA sequence is not. That's true but it seems to be stretching the selected-effect definition.

More importantly, there are many "bulk DNA" hypotheses that attempt to explain the presence of large amounts of superfluous DNA. For example, several workers have postulated that bulking up the genome leads to larger nuclei and larger cells. The bulk DNA has a function or sorts but the actual sequence is irrelevant.

These hypotheses may or may not be correct—I think they're wrong—but that's not the point. The point is that it's wrong to eliminate them by fiat on the grounds they don't meet the selected-effect definition of function. The arguments for and against bulk DNA hypotheses will have to be considered on their own merit regardless of any restriction imposed by blind allegiance to a specific definition of function.

There's a much more difficult example under this category. Active transposons can jump around in the genome. There are several examples of recent insertions in the human genome. Our closest relatives (chimps and bonobos) don't have a transposon at the orthologous locus.

The active transposon often carries genes for reverse transcriptase and some form of recombination enzyme for insertion and excision. The genes are transcribed and functional proteins are produced. This meets all the criteria for a causal-role definition of function but do these transposons really have a biological function or are they junk?

I maintain they are not junk, they are part of the functional DNA fraction of the genome. Other workers aren't so sure. They have developed a two-fold definition of biological function that distinguishes between function at the organism level and function at some other level. In this case, active transposon sequences don't meet the selected-effect definition at the level of the organism so they don't count as functional at that level.2 They may count as function at some other level, such as the intragenomic level or the level of selfish DNA, but that's not the same as the organismal level (Elliot et al., 2014; Doolittle et al., 2014).

Doolittle et al. published a figure that illustrates their description of function. Look at the quadrant on the upper right—the one labelled "function." That's the part of the genome that has a causal role and is conserved by selection at the organismal level.

Now look at the segment around 9 o'clock on their figure. That's DNA sequence that has a causal role and is also conserved. In this case the conservation is at a different level so it doesn't count as "function" by this definition. This is a case where a sequence seems to meet the selected-effect definition of function but it's being ruled out-of-bounds for other reasons. The best examples are active transposons and prophage such as integrated copies of bacteriophage lambda in E. coli.

This is a case where I prefer to stick to an unqualified selected-effect definition and call those sequences functional and not junk. (It's not clear whether the authors of those papers put them in the junk DNA category. I've asked them but they give equivocal answers.)

A sequence is conserved but not functional

Are there sequences that meet the selected-effect definition but don't have a biological function? Yes there are, but the border is fuzzy.

The problem here is operational rather than philosophical. Pseudogenes show clear evidence of sequence similarity when you compare different species but they are junk. You may argue this doesn't count because careful examination shows these sequences are drifting away from a common ancestor that once was an active gene. Eventually, their sequences will be indistinguishable from random sequence. True enough, but for now they are examples of sequences that meet the conservation criterion but they are junk.

There are other fuzzy examples. Many comparisons between genomes use small windows for their analysis. For example, they may look at 100 bp stretches along each of the genomes under comparison. Depending on their arbitrary cutoff, like 30% sequence similarity, they will detect many "conserved" sequences that are just due to chance. You need to be careful in assigning such sequences to the functional part of the genome. This isn't a problem with the definition, it's a problem with recognizing conservation.

De novo genes

De novo genes are new genes that have arisen in a particular lineage. There are problems identifying such genes because you have to determine whether they have a biological function before they count as genes. You can't use conservation as a criterion because, by definition, new genes aren't conserved. In this case you need to figure out whether the gene actually does something in spite of the fact it isn't conserved.

There's a nice review of the problems in the September issue of Nature Reviews: Genetics (McLysaght and Hurst, 2016). The number one problem is how to determine if a putative new gene is actually functional. This is a real problem since gene detection programs over-predict genes. Every new genomic sequence has dozens of potential new genes that have never been seen in any other species. These sequences are often referred to as ORFan genes because they have an

open reading frame. The designation is unfortunate since they are actually potential or putative genes, not confirmed genes.

Most of these putative ORFan genes turn out to be false positives produced by the computer programs. That's why the number of putative ORFan genes drops precipitously as the draft genome sequence gets annotated.3 There are only a handful left in the human genome. Some of them are real and some of them are still ambiguous.

Since we're dealing with putative protein-coding genes, the best way to determine if they are real is to see if you can detect an mRNA and a protein. That eliminates most of the candidates. The next step is to find out if the (usually small) protein has a biological function or is just junk protein. That's much harder. These are all causal-role issues.

McLysaght and Hurst point out that it's still possible to look at conservation as a criterion by comparing sequences in a large number of individuals. If the sequence is under selection you expect to see less variation than you see in neighboring junk DNA. Unfortunately, these putative genes are all quite small so there's hardly any variation within the population.

There are no easy answers but trying to decide whether a putative de novo gene is functional highlights some of the problems with defining function.

UPDATE (Dec. 12, 2016)
I haven't made my personal position clear in this post. My views are the same as those I outlined in previous posts (see below). I agree with Sean Eddy (Eddy, 2013) when he says,

Attention focused on the squabbling more than the substance, and probably led some to wonder whether the arguments were just quibbling over the semantics of the word ‘function’.

Trying to conceptualize the forces that act on genome evolution is not just a matter of semantics.

Here's what I said earlier ... my view hasn't changed.

Although I am going to quibble about the word “function” in this lengthy post, my main point is that the function wars are, for the most part, distracting and unproductive. We’re interested in the big picture—whether most of our genome is junk—and that’s not going to be resolved by settling on a definition of “function.” We have enough experience in biology to know that very few terms can be defined unambiguously (e.g. “gene,” “species”).

I think of my view as being pragmatic (and scientific) as opposed to philosophical (and metaphysical). Look at the quibbling in the comments. Who the hell cares whether Ford Doolittle and Dan Graur have defined function in the same way as philosphers did several decades ago? The important point is whether pseudogenes are junk; whether most of intron sequences are junk; and whether fragments of transposons are junk.

I also said ...

My position if is that there's no simple definition of function but sequence conservation is a good proxy. It's theoretically possible to have selection for functional bulk DNA that doesn't depend on sequence but, so far, there are no believable hypothesis that make the case. It is wrong to arbitrarily DEFINE function in terms of selection (for sequence) because that rules out all bulk DNA hypotheses by fiat and that's not a good way to do science.

But we can't work in a complete vacuum when it comes to function. There has to be some concept of what's functional and what's not. That's why I suggest the following as a "working definition."

So, we can adopt a working definition of function and junk based on whether or not deleting the DNA in question affects the survivability of the organism or its descendants. (Keeping in mind that there are minor exceptions).

Keep in mind also, that we aren't really going to delete every bit of DNA to test whether it is junk DNA or not. A lot of the debate will be in the form of thought experiments where a likely conclusion will be suggested by what we already know about a specific sequence.

I do not intend this operational, working definition to be definitive or rigorous. Back off, philosophers. It's just a ball-park estimate of what it means to talk about junk DNA.

Function Wars
(My personal view of the meaning of function is described at the end of Part V.)

1. In this particular example, the conclusion is bolstered by the fact we have an explanation for the phenomenon; namely, randoms mutations in junk DNA that accidentally create a binding site. Such sites will soon be uncreated by additional mutations as the genome evolves.

2. Nobody disputes the evidence that most transposon-related sequence are defective or fragments of once-active transposons. Collectively they make up almost 50% of the human genome. They are clearly junk.

3. Unfortunately, most genomes never get past the draft stage so there are hundreds of genomes with large numbers of ORFan "genes" that will never be corrected. Intelligent Design Creationists love to focus on those genomes.

4. The underlined words ("in most eukaryotes") is an update. I added it after Georgi Marinov pointed out in the comments that there are some unusual species that have introns as small as 18-21 bp. Those species seem to have a different spliceosomal mechanism than the one found in most other eukaryotes. The important point is not the exact size of the minimal intron but the necessity of having a minimal size consisting of some sequence that's not conserved.

Doolittle, W.F., Brunet, T.D., Linquist, S., and Gregory, T.R. (2014) Distinguishing between “function” and “effect” in genome biology. Genome biology and evolution 6, 1234-1237. [doi: 10.1093/gbe/evu098]

Elliott, T. A., Linquist, S. and Gregory, T. R. (2014) Conceptual and empirical challenges of ascribing functions to transposable elements. The American naturalist 184:14-24. [doi: 10.1086/676588]

McLysaght, A., and Hurst, L.D. (2016) Open questions in the study of de novo genes: what, how and why. Nature Reviews Genetics, 17:567-578. [doi: 10.1038/nrg.2016.78]