Sandwalk: The evolution of de novo genes

Monday, October 21, 2019

The evolution of de novo genes

De novo genes are new genes that arise spontaneously from junk DNA [De novo gene birth]. The frequency of de novo gene creation is important for an understanding of evolution. If it's a frequent event, then species with a large amount of junk DNA might have a selective advantage over species with less junk DNA, especially in a changing environment.

Last week I read a short Nature article on de novo genes [Levy, 2019] and I think the subject deserves more attention. Most new genes in a species appear to arise by gene duplication and subsequent divergence but de novo genes are genes that are unrelated to genes in any other clade so we can assume that they are created from junk DNA that accidentally becomes associated with a promoter causing the DNA to be transcribed. A new gene is formed if the RNA acquires a function. If the transcript contains an open reading frame then it may be translated to produce a polypeptide and if the polypeptide performs a new function then the resulting de novo gene is a new protein-coding gene.

The important question is whether the evolution of de novo genes is a common event or a rare event.

Noncoding genes

Noncoding genes¹ are genes that produce a functional RNA that isn't translated. The human genome contains several thousand well-established genes in this category and there is widespread speculation that we have thousands of others that have arisen recently in the human lineage. Thus, the prevailing view is that such genes arise very frequently.

These presumed de novo genes produce a variety of RNAs but many of them are referred to as lncRNAs. However, in spite of the prevailing belief, there is very little evidence that most of the postulated noncoding genes are real genes that produce functional RNAs. The fact that they produce RNAs isn't in doubt: what's in doubt is whether these RNAs are junk or not [How many lncRNAs are functional?]. Personally, I don't think there are very many de novo noncoding genes but it's still an open question.

We can't assume that the formation of de novo genes is a frequent event based on the data for noncoding genes. Conversely, we can't assume that it's a rare event until we have more data, although I think the evidence points in that direction.

Protein-coding genes

The recent review in Nature focuses exclusively on the formation of de novo protein-coding genes. The author, Adam Levy, gives some confirmed examples of such genes in fish and fruit flies and he speculates that there may be many other examples in rice, mice, and primates (including humans).

The idea that de novo genes could arise spontaneously has been around for a very long time and there have been suggestive examples in the scientific literature dating back to the 1980s but it has only been in the 21st century that really good examples have been demonstrated. There still aren't many confirmed cases. As in the case of noncoding genes, the number of speculative unproven examples is far higher. Nevertheless, Levy suggests that it's time to think about the implications ...

De novo genes are even prompting a rethink of some portions of evolutionary theory. Conventional wisdom was that new genes tended to arise when existing ones are accidentally duplicated, blended with others or broken up, but some researchers now think that de novo genes could be quite common: some studies suggest at least one-tenth of genes could be made in this way; others estimate that more genes could emerge de novo than from gene duplication.

It's easy to establish whether a potential new protein-coding gene produces a protein because all you have to do is identify the protein in some cell. If you can't detect the protein by looking in a wide variety of cells and tissues, then it's possible that the RNA is never translated. The absence of a protein has tentatively eliminated hundreds of potential protein-coding genes [Origin of de novo genes in humans].

Just because you can detect a protein made from a putative de novo gene doesn't mean that the protein is actually functional. It could just be a spurious polypeptide. (Most of the putative de novo genes only encode a short polypeptide of less than 100 amino acid residues.)

The other problem is that in order to be a truly de novo gene there must not be any homologs in other species. Demonstrating this is more difficult than it seems because of the lack of highly accurate genome sequences. A rigorous and critical analysis of putative mouse de novo genes has resulted in a substantial reduction of the total possibilities so that now there appear to be only 139 candidates. This means that the rate of formation of de novo genes in the rodent lineage (mouse vs rat) is about 12 per million years. This is an upper limit, the real rate is almost certainly lower (Casola, 2018).

The rate in primate lineages is unknown but a recent paper suggests that it is about 2 per million years (Guerzoni and McLysaght, 2016) and this is consistent with other rigorous analyses of de novo protein-coding genes in humans. The rates in primates and rodents are significant but they are at least an order of magnitude lower than the rates of new gene formation from duplication.

It is commonly believed that formation of new species and diversification within a clade is associated with the evolution of new genes. This assumption is unnecessary since both speciation and diversifaction can be achieved by simply modifying the expression of existing genes without the necessity of creating new genes—the field of evo-devo is devoted to proving this fact. However, if you believe that new genes are necessary then the only way to evolve genes with entirely new function is by de novo formation from junk DNA.² That leads to the suggestion that the presence of large amounts of junk DNA in a species/clade gives it a selective advantage over other species with smaller genomes.

This argument is suspicious because it sounds teleological and it invokes species level selection. It also has to deal with the fact that the rate of de novo gene formation is probably too low to account for any substantial amount of diversification over the time frames that are required for speciation.

1. It's awkward to define this group in negative terms (noncoding) but I haven't come up with a better term. Any suggestions are welcome.

2. Strictly speaking, this is not true since it's possible to create new genes from the opposite strand of existing genes. In fact, a good many of the putative de novo genes in rodents and primates fall into this category. I'm skeptical of those putative genes since there are very few proven examples of bona fide overlapping genes in eukaryotes.

Casola, C. (2018) From De Novo to “De Nono”: The Majority of Novel Protein-Coding Genes Identified with Phylostratigraphy Are Old Genes or Recent Duplicates. Genome Biology and Evolution, 10:2906-2918. [doi: 10.1093/gbe/evy231]

Guerzoni, D., and McLysaght, A. (2016) De novo genes arise at a slow but steady rate along the primate lineage and have been subject to incomplete lineage sorting. Genome Biology and Evolution, 8(4), 1222-1232. doi: [doi: 10.1093/gbe/evw074]

Levy, A. (2019) How evolution builds genes from scratch. Nature 574: 314-316. (Online title is: "Genes from the Junkyard.") [Nature]

37 comments:

JoãoMonday, October 21, 2019 7:43:00 PM
Have u ever heard of the IDiot Marcos Eberlin?
His book "Foresight: How the Chemistry of Life Reveals Planning and Purpose" was published this year

Here are three "endorsements"

“I am happy to recommend this book to those interested in the chemistry of life. Marcos Eberlin is well established in the field of chemistry and presents the current interest in biology in the context of chemistry.”—Sir John B. Gurdon, PhD, Nobel Prize in Physiology or Medicine (2012)

“An interesting study of the part played by foresight in biology.”—Brian David Josephson, Nobel Prize in Physics (1973)

“Despite the immense increase of knowledge during the past few centuries, there still exist important aspects of nature for which our scientific understanding reaches its limits. Eberlin describes in a concise manner a large number of such phenomena, ranging from life to astrophysics. Whenever in the past such a limit was reached, faith came into play. Eberlin calls this principle ‘foresight.’ Regardless of whether one shares Eberlin’s approach, it is definitely becoming clear that nature is still full of secrets which are beyond our rational understanding and force us to humility.”—Gerhard Ertl, PhD, Nobel Prize in Chemistry (2007)

I read this here: https://b-ok.cc/book/5207571/8329d8
ReplyDelete
Replies
JoãoThursday, October 24, 2019 5:57:00 AM
This comment has been removed by the author.
ReplyDelete
Replies
JoãoThursday, October 24, 2019 6:01:00 AM
Joe, thoughts on this?

https://onlinelibrary.wiley.com/doi/full/10.1111/evo.12517

Abstract
The existence of complex (multiple‐step) genetic adaptations that are “irreducible” (i.e., all partial combinations are less fit than the original genotype) is one of the longest standing problems in evolutionary biology. In standard genetics parlance, these adaptations require the crossing of a wide adaptive valley of deleterious intermediate stages. Here, we demonstrate, using a simple model, that evolution can cross wide valleys to produce “irreducibly complex” adaptations by making use of previously cryptic mutations. When revealed by an evolutionary capacitor, previously cryptic mutants have higher initial frequencies than do new mutations, bringing them closer to a valley‐crossing saddle in allele frequency space. Moreover, simple combinatorics implies an enormous number of candidate combinations exist within available cryptic genetic variation. We model the dynamics of crossing of a wide adaptive valley after a capacitance event using both numerical simulations and analytical approximations. Although individual valley crossing events become less likely as valleys widen, by taking the combinatorics of genotype space into account, we see that revealing cryptic variation can cause the frequent evolution of complex adaptations.
ReplyDelete
Replies
JoãoThursday, October 24, 2019 12:45:00 PM
This comment has been removed by the author.
ReplyDelete
Replies
Michael TressThursday, October 24, 2019 4:16:00 PM
Going back to the original article, I think the Guerzoni and Mclysaght paper needs to be taken with a large pinch of salt. Firstly, although the paper was published in 2016, it is already out of date.

As far as I can tell only 2 of the 35 coding genes supposed to have arisen in the Human-Chimpanzee-Gorilla lineage are still annotated as coding in Ensembl. One of these (TMEM133) has been reclassified as an unlikely looking alternative isoform buried in the 3' UTR of ARHGAP42, while the other (KRTAP20-4) will not have been revisited by the annotators because of complications in verifying the coding status of the large keratin-associated gene family.

The paper is also not as conservative as it claims since the authors use PRIDE and the GPM to search for novel peptides. These are two very useful databases, but they cannot be used to clear up doubts about coding status; due to their sheer size they will be stuffed to the rafters with false positive peptide-spectrum matches.

I would have used PeptideAtlas to attempt to validate the de novo genes because its quality control measures removes many (not all) of these false positive matches. I suspect that none of these "de novo coding genes" would have peptide evidence in PeptideAtlas (and TMEM133 and KRTAP20-4 certainly don't).
ReplyDelete
Replies
Chris AdamiSaturday, October 26, 2019 7:22:00 PM
It seems people keep forgetting that de novo genes can be formed by a mixture of gene duplication and junk DNA. I'm thinking of the Drosophila gene Jingwei as a typical example. It's formed from pieces of the yande (ynd) gene, which itself is a duplicated copy of the Yellow emperor (ymp) gene. The two pieces of the ynd gene were shuffled together with a retroposed copy of the alcohol dehydrogenase Adh.
ReplyDelete
Replies
Eric FalkensteinFriday, November 01, 2019 10:16:00 PM
If junk DNA are just random sequences of amino acids, isn't the probability these form a functional protein sequence incredibly small? That is, 1E-12 to 1E-77 (I'm not a biologist so I see estimates all over the map, but the largest is very small)? If the DNA in the middle of the reading frame is random, it would seem to have a small probability of creating a protein generating a selective advantage.
ReplyDelete
Replies
judmarcWednesday, November 06, 2019 5:51:00 PM
Saw a piece about the article and immediately hoped you'd write about it. Thanks to you and Michael Tress in the comments for quantifying things (to the extent possible).
ReplyDelete
Replies
ThumbTuesday, November 12, 2019 9:17:00 AM
One likely reason: junk DNA is for a large part made of sequences "devolved" from functional ones, so the mutational distance to something that could work could be much smaller than starting from purely random DNA. HTH, though I'm no biologist either.
ReplyDelete
Replies
JoãoWednesday, April 22, 2020 4:15:00 PM
New paper on Drosophila de novo genes:

Together, our results suggest that gene emergence from non-coding DNA provides an abundant source of material for the evolution of new proteins. Following gene birth, gradual evolution over large evolutionary timescales moulds sequence properties towards those of conserved genes, resulting in a continuum of properties whose starting points depend on the nucleotide sequences of an initial pool of novel genes.

https://link.springer.com/article/10.1007/s00239-020-09939-z?wt_mc=alerts.TOCjournals
ReplyDelete
Replies

Add comment