How much of the human genome is conserved?
The first post in this series covered the various definitions of "function" [Quibbling about the meaning of the word "function"]. In the second post I tried to create a working definition of "function" and I discussed whether active transposons count as functional regions of the genome or junk [The Function Wars: Part II]. I claim that junk DNA is DNA that is nonfunctional and it can be deleted from the genome of an organism without affecting its survival, or the survival of its descendants.
The best way to define "function" is to rely on evolution. DNA that is under selection is functional. But how can you determine whether a given stretch of DNA is being preserved by natural selection? The easiest way is to look at sequence conservation. If the sequence has not changed at the rate expected of neutral changes fixed by random genetic drift then it is under negative selection. Unfortunately, sequence conservation only applies to regions of the genome where the sequence is important. It doesn't apply to DNA that is selected for its bulk properties.
Let's look at how much of the human genome is conserved (sequence). Keep in mind that this value has to be less than 10% based on genetic load arguments. It should be less than 5%.
A recent paper by Rands et al. (2014) is informative. They start their paper by asking "What proportion of the human genome is functional?" and they propose a definition in the introduction ...
... evolutionary studies often equate functionality with signatures of selection. While it is undisputed that many functional regions have evolved under complex selective regimes including selective sweeps  or ongoing balancing selection , , and it appears likely that loci exist where recent positive selection or reduction of constraint has decoupled deep evolutionary patterns from present functional status , , it is widely accepted that purifying selection persisting over long evolutionary times is a ubiquitous mode of evolution , . While acknowledging the caveats, this justifies the definition of functional nucleotides used here, as those that are presently subject to purifying selection.By emphasizing presently, they eliminate pseudogenes and defective transposons that look conserved because they have descended from a common ancestor but aren't currently subject to purifying selection. The authors look at the fraction of the human genome whose sequence is conserved. That's not the only DNA that might be under constraint but it's still an important number.
This is of course not useful as an operational definition, as selection cannot be measured instantaneously. Instead, most studies define functional sites as those subject to purifying selection between two (or more) particular species. Studies that follow this definition have estimated the proportion of functional nucleotides in the human genome, denoted as αsel , , between 3% and 15% ( and references therein, ). Since each species' lineage gains and loses functional elements over time, αsel needs to be understood in the context of divergence between species.
The proportion of functional nucleotides (αsel) can be narrowed down by restricting the analysis to those sequences that can be aligned. That's the pert between deletions and insertions (indels). This value is αselIndel.
UPDATE: I assumed that the authors were looking at sequence conservation but I was wrong, according to Chris Rands, the first author on the paper (see comments). He says, "In brief, we have a background model of what we expect the distribution of indels to be in neutral sequence that is not under constraint." Now I'm completely confused because what they seem to be describing is the amount of DNA that doesn't have any insertions or deletions in the genomes of other mammals. I cannot recommend this paper and I dismiss the conclusion of 8% constraint until I have a better understanding of what they mean.
This paper uses a complicated algorithm that I do not understand and I do not intend to try. I don't really know how any of the estimates of constrain/conservation are obtained but I assume it involves looking at sequence similarity within windows of a fixed size as one scans along the aligned sequences. This paper looks at conservation between humans and several species of mammal including rhino, panda, and cattle. I'm under the impression that these other genomes are far from finished so I don't know how reliable that data is. I assume that the alignments are done by computer and I'm very skeptical of such alignments. I don't know how they eliminated pseudogenes and degenerate transposons sequences and I don't know how they deal with gene families.
Nevertheless, the estimate of constrained sequence is between 220 and 286Mb, This corresponds to 7.1-9.2% of the genome. (The authors are using 3,100 Mb as the total size of the sequenced portion of the human genome.) Let's say it's 8%. About 1% of the genome is conserved coding sequence and the rest (7%) is noncoding.
This estimate lies between some estimates that are as low as 5% and others that are about 15%. Here's what the Rands et al. have to say about that ...
Our estimate that 7.1–9.2% of human genomes is subject to contemporaneous selective constraint considerably exceeds previous estimates and falls short of others , . We have shown that our method's previous estimates for specific species pairs, as well as the calculation that suggested 10–15% of the human genome is currently under negative selection were inflated , in large part owing to inaccuracies in whole genome alignments upon which our estimates were based.So, the best we can say is that about 8% of the sequences in the human genome appear to be under purifying selection. If you assume that this defines functional sequences then >90% of our genome is junk.
One thing is clear.
Our estimate that 7.1%–9.2% of the human genome is functional is around ten-fold lower than the quantity of sequence covered by the ENCODE defined elements , , . This indicates that a large fraction of the sequence comprised by elements identified by ENCODE as having biochemical activity can be deleted without impacting on fitness. By contrast, the fraction of the human genome that is covered by coding exons, bound motifs and DNase1 footprints, all elements that are likely to contain a high fraction of nucleotides under selection, is 9%. While not all of the elements in these categories will be functional, and functional elements will exist outside of these categories, this figure is consistent with the proportion of sequence we estimate as being currently under the influence of selection.This is one more paper that disputes the ENCODE conclusions. At some point, the major journals like Nature and Science are going to have to admit that they were duped by the ENCODE Consortium. (Are you listening, Elizabeth Pennisi? (@epennisi))
1. Alex Palazzo suggested that we call these the "function wars." Thanks, Alex.
Rands, C.M., Meader, S., Ponting, C.P. and Lunter, G. (2014) 8.2% of the Human Genome Is Constrained: Variation in Rates of Turnover across Functional Element Classes in the Human Lineage. PLoS genetics 10, e1004525. [doi: 10.1371/journal.pgen.1004525]