More Recent Comments

Monday, October 21, 2019

The evolution of de novo genes

De novo genes are new genes that arise spontaneously from junk DNA [De novo gene birth]. The frequency of de novo gene creation is important for an understanding of evolution. If it's a frequent event, then species with a large amount of junk DNA might have a selective advantage over species with less junk DNA, especially in a changing environment.

Last week I read a short Nature article on de novo genes [Levy, 2019] and I think the subject deserves more attention. Most new genes in a species appear to arise by gene duplication and subsequent divergence but de novo genes are genes that are unrelated to genes in any other clade so we can assume that they are created from junk DNA that accidentally becomes associated with a promoter causing the DNA to be transcribed. A new gene is formed if the RNA acquires a function. If the transcript contains an open reading frame then it may be translated to produce a polypeptide and if the polypeptide performs a new function then the resulting de novo gene is a new protein-coding gene.

The important question is whether the evolution of de novo genes is a common event or a rare event.

Noncoding genes

Noncoding genes1 are genes that produce a functional RNA that isn't translated. The human genome contains several thousand well-established genes in this category and there is widespread speculation that we have thousands of others that have arisen recently in the human lineage. Thus, the prevailing view is that such genes arise very frequently.

These presumed de novo genes produce a variety of RNAs but many of them are referred to as lncRNAs. However, in spite of the prevailing belief, there is very little evidence that most of the postulated noncoding genes are real genes that produce functional RNAs. The fact that they produce RNAs isn't in doubt: what's in doubt is whether these RNAs are junk or not [How many lncRNAs are functional?]. Personally, I don't think there are very many de novo noncoding genes but it's still an open question.

We can't assume that the formation of de novo genes is a frequent event based on the data for noncoding genes. Conversely, we can't assume that it's a rare event until we have more data, although I think the evidence points in that direction.

Protein-coding genes

The recent review in Nature focuses exclusively on the formation of de novo protein-coding genes. The author, Adam Levy, gives some confirmed examples of such genes in fish and fruit flies and he speculates that there may be many other examples in rice, mice, and primates (including humans).

The idea that de novo genes could arise spontaneously has been around for a very long time and there have been suggestive examples in the scientific literature dating back to the 1980s but it has only been in the 21st century that really good examples have been demonstrated. There still aren't many confirmed cases. As in the case of noncoding genes, the number of speculative unproven examples is far higher. Nevertheless, Levy suggests that it's time to think about the implications ...
De novo genes are even prompting a rethink of some portions of evolutionary theory. Conventional wisdom was that new genes tended to arise when existing ones are accidentally duplicated, blended with others or broken up, but some researchers now think that de novo genes could be quite common: some studies suggest at least one-tenth of genes could be made in this way; others estimate that more genes could emerge de novo than from gene duplication.
It's easy to establish whether a potential new protein-coding gene produces a protein because all you have to do is identify the protein in some cell. If you can't detect the protein by looking in a wide variety of cells and tissues, then it's possible that the RNA is never translated. The absence of a protein has tentatively eliminated hundreds of potential protein-coding genes [Origin of de novo genes in humans].

Just because you can detect a protein made from a putative de novo gene doesn't mean that the protein is actually functional. It could just be a spurious polypeptide. (Most of the putative de novo genes only encode a short polypeptide of less than 100 amino acid residues.)

The other problem is that in order to be a truly de novo gene there must not be any homologs in other species. Demonstrating this is more difficult than it seems because of the lack of highly accurate genome sequences. A rigorous and critical analysis of putative mouse de novo genes has resulted in a substantial reduction of the total possibilities so that now there appear to be only 139 candidates. This means that the rate of formation of de novo genes in the rodent lineage (mouse vs rat) is about 12 per million years. This is an upper limit, the real rate is almost certainly lower (Casola, 2018).

The rate in primate lineages is unknown but a recent paper suggests that it is about 2 per million years (Guerzoni and McLysaght, 2016) and this is consistent with other rigorous analyses of de novo protein-coding genes in humans. The rates in primates and rodents are significant but they are at least an order of magnitude lower than the rates of new gene formation from duplication.

It is commonly believed that formation of new species and diversification within a clade is associated with the evolution of new genes. This assumption is unnecessary since both speciation and diversifaction can be achieved by simply modifying the expression of existing genes without the necessity of creating new genes—the field of evo-devo is devoted to proving this fact. However, if you believe that new genes are necessary then the only way to evolve genes with entirely new function is by de novo formation from junk DNA.2 That leads to the suggestion that the presence of large amounts of junk DNA in a species/clade gives it a selective advantage over other species with smaller genomes.

This argument is suspicious because it sounds teleological and it invokes species level selection. It also has to deal with the fact that the rate of de novo gene formation is probably too low to account for any substantial amount of diversification over the time frames that are required for speciation.


1. It's awkward to define this group in negative terms (noncoding) but I haven't come up with a better term. Any suggestions are welcome.

2. Strictly speaking, this is not true since it's possible to create new genes from the opposite strand of existing genes. In fact, a good many of the putative de novo genes in rodents and primates fall into this category. I'm skeptical of those putative genes since there are very few proven examples of bona fide overlapping genes in eukaryotes.

Casola, C. (2018) From De Novo to “De Nono”: The Majority of Novel Protein-Coding Genes Identified with Phylostratigraphy Are Old Genes or Recent Duplicates. Genome Biology and Evolution, 10:2906-2918. [doi: 10.1093/gbe/evy231]

Guerzoni, D., and McLysaght, A. (2016) De novo genes arise at a slow but steady rate along the primate lineage and have been subject to incomplete lineage sorting. Genome Biology and Evolution, 8(4), 1222-1232. doi: [doi: 10.1093/gbe/evw074]

Levy, A. (2019) How evolution builds genes from scratch. Nature 574: 314-316. (Online title is: "Genes from the Junkyard.") [Nature]

37 comments :

João said...

Have u ever heard of the IDiot Marcos Eberlin?
His book "Foresight: How the Chemistry of Life Reveals Planning and Purpose" was published this year

Here are three "endorsements"

“I am happy to recommend this book to those interested in the chemistry of life. Marcos Eberlin is well established in the field of chemistry and presents the current interest in biology in the context of chemistry.”—Sir John B. Gurdon, PhD, Nobel Prize in Physiology or Medicine (2012)

“An interesting study of the part played by foresight in biology.”—Brian David Josephson, Nobel Prize in Physics (1973)

“Despite the immense increase of knowledge during the past few centuries, there still exist important aspects of nature for which our scientific understanding reaches its limits. Eberlin describes in a concise manner a large number of such phenomena, ranging from life to astrophysics. Whenever in the past such a limit was reached, faith came into play. Eberlin calls this principle ‘foresight.’ Regardless of whether one shares Eberlin’s approach, it is definitely becoming clear that nature is still full of secrets which are beyond our rational understanding and force us to humility.”—Gerhard Ertl, PhD, Nobel Prize in Chemistry (2007)

I read this here: https://b-ok.cc/book/5207571/8329d8

Joe Felsenstein said...

He seems to have become the Discovery Institute's affiliate in Brazil. Reading the book blurbs you linked to, it sounds as if his arguments are mostly "look, it's so brilliant an adaptation that it must be designed and not have evolved!" He also seems to accept everything else the DI says, including their worst arguments. The endorsements above are by people who are not evolutionary biologists, and two of them are not any kind of biologist. Even so, their endorsements seem less than total.

João said...

Yes, Joe, we can say he is the voice of ID in Brazil (I am brazilian). And yes, I agree that the endorsements are not total. I think the used them only because they are Nobel laureates. I briefly checked their curricula, and one them has quite strange beliefs, so that it wouldn't be any surprise of he suports ID.

Thank you for replying me.

João said...
This comment has been removed by the author.
João said...

Joe, thoughts on this?

https://onlinelibrary.wiley.com/doi/full/10.1111/evo.12517

Abstract
The existence of complex (multiple‐step) genetic adaptations that are “irreducible” (i.e., all partial combinations are less fit than the original genotype) is one of the longest standing problems in evolutionary biology. In standard genetics parlance, these adaptations require the crossing of a wide adaptive valley of deleterious intermediate stages. Here, we demonstrate, using a simple model, that evolution can cross wide valleys to produce “irreducibly complex” adaptations by making use of previously cryptic mutations. When revealed by an evolutionary capacitor, previously cryptic mutants have higher initial frequencies than do new mutations, bringing them closer to a valley‐crossing saddle in allele frequency space. Moreover, simple combinatorics implies an enormous number of candidate combinations exist within available cryptic genetic variation. We model the dynamics of crossing of a wide adaptive valley after a capacitance event using both numerical simulations and analytical approximations. Although individual valley crossing events become less likely as valleys widen, by taking the combinatorics of genotype space into account, we see that revealing cryptic variation can cause the frequent evolution of complex adaptations.

Joe Felsenstein said...

Do not have time to review all their population genetics theory, but I don't see anything amiss, offhand. Joanna Masel and co. are knowledgeable about this -- I have not worked on Irreducible Complexity. They seem to cite all the background papers that I know of. Was there some issue that was raised about this paper?

John Harshman said...

"genetic adaptations that are “irreducible” (i.e., all partial combinations are less fit than the original genotype)"

That's not even the definition of "irreducible" used by Behe. I can see why, as it clearly suggests the possibility of an evolutionary pathway of increasing fitness leading from a prior state to the current one.

Joe Felsenstein said...

Abraham Lincoln is said to have been asked for an endorsement if a book. His endorsement was "For people who like this sort of thing, this is the sort of thing they would like." If he had a Nobel Prize, then people might put that on the back cover of their book.

João said...

"Was there some issue that was raised about this paper?"
Well, no. I saw anybody discussing it.

Harshman Said above:

"genetic adaptations that are “irreducible” (i.e., all partial combinations are less fit than the original genotype)"

That's not even the definition of "irreducible" used by Behe.


Do you agree, Joe?

João said...
This comment has been removed by the author.
João said...

Sorry. What I wanna say is: the above definition is not equivalent to the following?

"An irreducibly complex evolutionary pathway is one that contains one or more unselected steps (that is, one or more necessary-but-unselected mutations)."

João said...

I gotw it. Well, Totter et al dfn says that all partial combinations até less fit than the original genotype, while Behe says that at least one step is unselected.

Question:

If Totter et al results, using their dedinition of IC, suggest IC to be a Common product, why should one think that IC (Behe) won't even vê possible?

João said...
This comment has been removed by the author.
João said...

I'm writing like a chimp. Sorry, guys. Sorry. ={ I write in english but technology wants a write in portuguese, so you read things like "vê possible" (be possible).

Again, I'm Sorry.


Well, I see why Totter et al and Behe dfns are not the same.

Still, I think that if IC (sensu Totter et al) frequently evolves, so do IC sensu Behe.

Michael Tress said...

Going back to the original article, I think the Guerzoni and Mclysaght paper needs to be taken with a large pinch of salt. Firstly, although the paper was published in 2016, it is already out of date.

As far as I can tell only 2 of the 35 coding genes supposed to have arisen in the Human-Chimpanzee-Gorilla lineage are still annotated as coding in Ensembl. One of these (TMEM133) has been reclassified as an unlikely looking alternative isoform buried in the 3' UTR of ARHGAP42, while the other (KRTAP20-4) will not have been revisited by the annotators because of complications in verifying the coding status of the large keratin-associated gene family.

The paper is also not as conservative as it claims since the authors use PRIDE and the GPM to search for novel peptides. These are two very useful databases, but they cannot be used to clear up doubts about coding status; due to their sheer size they will be stuffed to the rafters with false positive peptide-spectrum matches.

I would have used PeptideAtlas to attempt to validate the de novo genes because its quality control measures removes many (not all) of these false positive matches. I suspect that none of these "de novo coding genes" would have peptide evidence in PeptideAtlas (and TMEM133 and KRTAP20-4 certainly don't).

Joe Felsenstein said...

Offhand (without actually going back and reading him) my recollection is Behe's definition is that a complex has IC if none of its components can be removed without making it nonfunctional. His argument was that such a complex could not have evolved by ordinary evolutionary processes. Multiple people (such as Allen Orr) immediately pointed out that interacting molecules could start out interacting weakly, and then natural selection could make them interact more and more strongly, until one had IC by Behe's definition. So there were possible paths to IC, contrary to Behe's assertion.

João said...

Yeah, I know about the Orr-like approache. But somewhere¹ Behe has redefine IC to be:

""An irreducibly complex evolutionary pathway is one that contains one or more unselected steps (that is, one or more necessary-but-unselected mutations)."

If we take IC as defined above, I see that this is not a problem at all. And Totter et al maybe provide a good point against those who say that this type of IC cannot evolve. Moreover, Behe seems to neglect Neutral Evolution. By the way, Constructive Neutral Evolution can frequently be involved on the evolution of IC systems.
1 - In Defense of the Irreducibility of the Blood Clotting Cascade: Response to Russell Doolittle, Ken Miller and Keith Robison, July 31, 2000, Discovery Institute article. Link here: https://web.archive.org/web/20150906193225/http://www.discovery.org/a/442

Larry Moran said...

I agree that there are still lots of questions about the number of real de novo genes in humans. The purpose of my post was to show that this number keeps dropping as more and more potential candidates are eliminated. Today's most optimistic numbers are far too low to justify any serious speculation about junk DNA contributing to the evolution of a clade.

Chris Adami said...

It seems people keep forgetting that de novo genes can be formed by a mixture of gene duplication and junk DNA. I'm thinking of the Drosophila gene Jingwei as a typical example. It's formed from pieces of the yande (ynd) gene, which itself is a duplicated copy of the Yellow emperor (ymp) gene. The two pieces of the ynd gene were shuffled together with a retroposed copy of the alcohol dehydrogenase Adh.

Larry Moran said...

I'm not forgetting about examples like that. It's just easier to explain the problem by concentrating on genes that have arisen entirely from junk DNA. I think most of us know that biology is messy and all sorts of things can happen that don't fit easily into a particular category.

Chris Adami said...

Wasn't implying you in particular :-) But yes, that's the right take. Anything goes that's not explicitly forbidden. Biology doesn't care about the categories that we set up, but that's what we have to do in order to make progress.

Jonathan P. Dowling said...

This is what I get when I Google, "Are seahorses kosher?"

Larry Moran said...

The answer is "yes." :-)

HTHHAND

David said...

I think the trap always lies in attempting to transition any evolutionary mechanism back to humans. We're a young species with a relatively long generation time; we also lack many of the ecological constraints that can drive genomic signatures of evolution. The best way to understand the evolutionary legacy of humans is to spend more effort looking outside of humans

Bill Cole said...

Joe F
"Further mutations might make the interaction more essential and make the two subunits more dependent on one another. This is a perfectly reasonable scenario for the evolution of irreducible complexity. Anyone who claims that the very existence of irreducibly complexity means that a structure could not have evolved is wrong."

His argument "is a powerful challenge" not "could not evolve". His argument also points to specific structures like blood clotting and the flagellar motor.

Joe Felsenstein said...

@BillCole: By "further mutations" I meant " ... further mutations, favored by natural selection." Note that people do use IC as an argument for "could not evolve". They shouldn't.

João said...

Marcos Eberlin says Behe has proved that the Natural Selection can, at best, fix just 2 mutation. Here in Brazil is very common to hear crationists saying that IC means that the IC systam cannot be the product of evolution by natural selection.

Larry Moran said...

In Darwin's Black Box (1996) Behe says that IC systems cannot be produced by natural selection. He admits that IC systems can arise by circuitous routes but these routes are so improbable that they are effectively impossible. In the last decade or so, he has come to recognize that neutral and deleterious mutations can be fixed and that this could lead to some IC systems but he still thinks that these are very rare.

In The Edge of Evolution (2007) he argues that many systems can only be explained by postulating that several different beneficial mutations must have occurred if they evolved. He argues that none of these mutations by themselves would have had the required benefit so they must have occurred simultaneously. He points out, correctly, that two such mutations could possibly occur at the same time but any system that requires three or more is impossible.

It's trivially easy to show where Behe goes wrong ...

Revisiting Michael Behe's challenge and revealing a closed mind

Understanding Michael Behe's edge of evolution

João said...

Yeah, Yeah, I've read a lot of your blog. Indeed Behe seems to have a close mind. Larry, did u read Darwin Devolves? A friend read the book and he said it has few (if any) new thigns to argue about evolution. I'm curious about your thoughts on the book.

Eric Falkenstein said...

If junk DNA are just random sequences of amino acids, isn't the probability these form a functional protein sequence incredibly small? That is, 1E-12 to 1E-77 (I'm not a biologist so I see estimates all over the map, but the largest is very small)? If the DNA in the middle of the reading frame is random, it would seem to have a small probability of creating a protein generating a selective advantage.

Mikkel Rumraket Rasmussen said...

Since the evidence tells us that lots of proteins have evolved from junk DNA over the history of life, that would seem to contradict your speculation that the probability of forming a functional protein sequence is "incredibly small", if by "incredibly small" you mean "so unlikely as to never happen". Obviously, since they evolve (on geological timescales) at a steady pace, the likelihood must be good enough for that to occur.

judmarc said...

Saw a piece about the article and immediately hoped you'd write about it. Thanks to you and Michael Tress in the comments for quantifying things (to the extent possible).

Thumb said...

One likely reason: junk DNA is for a large part made of sequences "devolved" from functional ones, so the mutational distance to something that could work could be much smaller than starting from purely random DNA. HTH, though I'm no biologist either.

Michael Tress said...

A lot of "junk" DNA does come from sequences that were previously functional as transposable elements (in fact almost certainly most of it does). But not all of these fossil transposable elements produced proteins, many don't and the conversion of transposable elements into functional coding genes is rare, at least within primates.

I haven't seen a comparison of the origins of de novo coding genes in primates to be able to comment on what proportion of them might have derived from transposable element open reading frames (ORFs). It has been claimed that many de novo genes are derived from SINE Alu elements, but Alu elements don't produce any ORFs and in any case there aren't any Alu-derived coding genes in the human genome. SINE Alu fragments do seem to be preferentially incorporated as novel exons in existing coding genes though.

João said...

New paper on Drosophila de novo genes:

Together, our results suggest that gene emergence from non-coding DNA provides an abundant source of material for the evolution of new proteins. Following gene birth, gradual evolution over large evolutionary timescales moulds sequence properties towards those of conserved genes, resulting in a continuum of properties whose starting points depend on the nucleotide sequences of an initial pool of novel genes.

https://link.springer.com/article/10.1007/s00239-020-09939-z?wt_mc=alerts.TOCjournals

Larry Moran said...

The opening sentence of the abstract is, "Orphan genes, lacking detectable homologs in outgroup species, typically represent 10-30% of eukaryotic genes."

This is a ridiculous statement. What they mean to say is that there are bits of DNA with predicted open reading frames that are not found in related species. These are not orphan genes-they are POTENTIAL orphan genes. Very few of them turn out to be functional when examined closely.

The Journal of Molecular Evolution used to be a good journal but no expert in molecular evolution should ever have approved this paper for publication.

Michael Tress said...

Just to back Larry up with data, there are currently no (zero) orphan coding genes annotated in the human gene set, and all attempts to prove the existence of orphan human coding genes have so far failed. Either the gene clearly wasn't coding, or clearly wasn't an orphan.