More Recent Comments

Friday, March 15, 2024

Nils Walter disputes junk DNA: (9) Reconciliation

I'm discussing a recent paper published by Nils Walter (Walter, 2024). He is arguing against junk DNA by claiming that the human genome contains large numbers of non-coding genes.

This is the ninth and last post in the series. I'm going to discuss Walker's view on how to tone down the dispute over the amount of junk in the human genome. Here's a list of the previous posts.

"Conclusion: How to Reconcile Scientific Fields"

Walter concludes his paper with some thoughts on how to deal with the controversy going forward. I'm using the title that he choose. As you can see from the title, he views this as a squabble between two different scientific fields, which he usually identifies as geneticists and evolutionary biologists versus biochemists and molecular biologists. I don't agree with this distinction. I'm a biochemist and molecular biologist, not a geneticist or an evolutionary biologist, and still I think that many of his arguments are flawed.

Let's see what he has to say about reconciliation.

Science thrives from integrating diverse viewpoints—the more diverse the team, the better the science.[107] Previous attempts at reconciling the divergent assessments about the functional significance of the large number of ncRNAs transcribed from most of the human genome by pointing out that the scientific approaches of geneticists, evolutionary biologists and molecular biologists/biochemists provide complementary information[42] was met with further skepticism.[74] Perhaps a first step toward reconciliation, now that ncRNAs appear to increasingly leave the junkyard,[35] would be to substitute the needlessly categorical and derogative word RNA (or DNA) “junk” for the more agnostic and neutral term “ncRNA of unknown phenotypic function”, or “ncRNAupf”. After all, everyone seems to agree that the controversy mostly stems from divergent definitions of the term “function”,[42, 74] which each scientific field necessarily defines based on its own need for understanding the molecular and mechanistic details of a system (Figure 3). In addition, “of unknown phenotypic function” honors the null hypothesis that no function manifesting in a phenotype is currently known, but may still be discovered. It also allows for the possibility that, in the end, some transcribed ncRNAs may never be assigned a bona fide function.

First, let's take note of the fact that this is a discussion about whether a large percentage of transcripts are functional or not. It is not about the bigger picture of whether most of the genome is junk in spite of the fact that Nils Walter frames it in that manner. This becomes clear when you stop and consider the implications of Walter's claim. Let's assume that there really are 200,000 functional non-coding genes in the human genome. If we assume that each one is about 1000 bp long then this amounts to 6.5% of the genome—a value that can easily be accommodated within the 10% of the genome that's conserved and functional.

Now let's look at how he frames the actual disagreement. He says that the groups on both sides of the argument provide "complementary information." Really? One group says that if you can delete a given region of DNA with no effect on the survival of the individual or the species then it's junk and the other group says that it still could have a function as long as it's doing something like being transcribed or binding a transcription factor. Those don't look like "complimentary" opinions to me.

His first step toward reconciliation starts with "now that ncRNAs appear to increasingly leave the junkyard." That's not a very conciliatory way to start a conversation because it immediately brings up the question of how many ncRNAs we're talking about. Well-characterized non-coding genes include ribosomal RNA genes (~600), tRNA genes (~200), the collection of small non-coding genes (snRNA, snoRNA, microRNA, siRNA, PiWi RNA)(~200), several lncRNAs (<100), and genes for several specialized RNAs such as 7SL and the RNA component of RNAse P (~10). I think that there are no more than 1000 extra non-coding genes falling outside these well-known examples and that's a generous estimate. If he has evidence for large numbers that have left the junkyard then he should have presented it.

Walter goes on to propose that we should divide non-coding transcripts into two categories; those with well-characterized functions and "ncRNA of unknown function." That's ridiculous. That is not a "agnostic and neutral term." It implies that non-conserved transcripts that are present at less that one copy per cell could still have a function in spite of the fact that spurious transcription is well-documented. In fact, he basically admits this interpretation at the end of the paragraph where he says that using this description (ncRNA of unknown function) preserves the possibility that a function might be discovered in the future. He thinks this is the "null hypothesis."

The real null hypothesis is that a transcript has no function until it can be demonstrated. Notice that I use the word "transcript" to describe these RNAs instead of "ncRNA" or "ncRNA of unknown phenotypic function." I don't think we lose anything by using the word "transcript."

Walter also address the meaning of "function" by claiming that different scientific fields use different definitions as though that excuses the conflict. But that's not an accurate portrayal of the problem. All scientists, no matter what field they identify with, are interested in coming up with a way of identifying functional DNA. There are many biochemists and molecular biologists who accept the maintenance definition as the best available definition of function. As scientists, they are more than willing to entertain any reasonable scientific arguments in favor of a different definition but nobody, including Nils Walter, has come up with such arguments.

Now let's look at the final paragraph of Walter's essay.

Most bioscientists will also agree that we need to continue advancing from simply cataloging non-coding regions of the human genome toward characterizing ncRNA functions, both elementally and phenotypically, an endeavor of great challenge that requires everyone's input. Solving the enigma of human gene expression, so intricately linked to the regulatory roles of ncRNAs, holds the key to devising personalized medicines to treat most, if not all, human diseases, rendering the stakes high, and unresolved disputes counterproductive.[108] The fact that newly ascendant RNA therapeutics that directly interface with cellular RNAs seem to finally show us a path to success in this challenge[109] only makes the need for deciphering ncRNA function more urgent. Succeeding in this goal would finally fulfill the promise of the human genome project after it revealed so much non-protein coding sequence (Figure 1). As a side effect, it may make updating Wikipedia and encyclopedia entries less controversial.

I agree that it's time for scientists to start identifying those transcripts that have a true function. I'll go one step further; it's time to stop pretending that there might be hundreds of thousands of functional transcripts until you actually have some data to support such a claim.

I take issue with the phrase "solving the enigma of human gene expression." I think we already have a very good understanding of the fundamental mechanisms of gene expression in eukaryotes, including the transitions between open and closed chromatin domains. There may be a few odd cases that deviate from the norm (e.g. Xist) but that hardly qualifies as an "enigma." He then goes on to say that this "enigma" is "intricately linked to the regulatory roles of ncRNAs" but that's not a fact, it's what's in dispute and why we have to start identifying the true function (if any) of most transcripts. Oh, and by the way, sorting out which parts of the genome contain real non-coding genes may contribute to our understanding of genetic diseases in humans but it won't help solve the big problem of how much of our genome is junk because mutations in junk DNA can cause genetic diseases.

Sorting out which transcripts are functional and which ones are not will help fill in the 10% of the genome that's functional but it will have little effect on the bigger picture of a genome that's 90% junk.

We've known that less than 2% of the genome codes for proteins since the late 1960s—long before the draft sequence of the human genome was published in 2001—and we've known for just as long that lots of non-coding DNA has a function. It would be helpful if these facts were made more widely known instead of implying that they were only dscovered when the human genome was sequenced.

Once we sort out which transcripts are functional, we'll be in a much better position to describe the all the facts when we edit Wikipedia articles. Until that time, I (and others) will continue to resist the attempts by the students in Nils Walter's class to remove all references to junk DNA.

Walter, N.G. (2024) Are non‐protein coding RNAs junk or treasure? An attempt to explain and reconcile opposing viewpoints of whether the human genome is mostly transcribed into non‐functional or functional RNAs. BioEssays:2300201. [doi: 10.1002/bies.202300201]


Forsdyke said...


Nils Walter suggests that different scientific fields view the “junk” hypothesis differently (“two different scientific fields, which he usually identifies as geneticists and evolutionary biologists versus biochemists and molecular biologists”). Having for six decades worked in these fields (plus immunology), I again draw attention to the possibility that certain sequences do not operate in every generation but may only function in occasional generations.

If an RNA sequence happens to complement the sequence of an intracellular pathogen, then a length of dsRNA will form that may activate various alarms (e.g. inhibit protein synthesis). This was discovered in the early 1970s by various biochemists associated with the Korner laboratory (Cambridge). Even if operating rarely, if that operation is advantageous the sequence can still be conserved by natural selection. Here is one of my past Sandwalk comments that draws attention to Zabolotneva’s work on the topic:


A viral infection is a stress, and stress has long been known to provoke the transcription of repetitive elements (e.g. Alu; Liu et al, 1995), from which run-on transcription can extend into neighboring regions of “junk” DNA (Cristillo et al. 1991).

Thus, we and others (e.g. Zabolotneva et al. 2010) have suggested an RNA-based "intracellular 'immune system'," where alarms begin ringing when an RNA of viral origin forms dsRNA with an 'RNA antibody' of host origin. Like us, Zabolotneva et al. suspect that this may explain the variable quantities of 'junk DNA' found in genomes:

Quote:"Casual [random] combinations of nucleotides in ... the genome might create new DNA motifs that theoretically, after being transcribed, could be used by the host organism as a tool for recognition and targeting of intracellular pathogen transcripts. Novel transcribed [host] DNA motifs that would target the host genes [i.e. 'self'] would be eliminated from the genome [negative selection], whereas those that complementarily match the pathogen RNAs would be positively selected. Neutral motifs [yet to find a pathogen target but not interacting with 'self'] could be 'stored' in the genomes as ordinary non-coding DNA."

Others envisage a “retroposon transcriptome” (Faulkner et al. 2009).


Cristillo et al. (2001) Double-stranded RNA as a not-self alarm signal: to evade, most viruses purine-load their RNAs, but some (HTLV-1, Epstein-Barr) pyrimidine-load. J. Theor. Biol. 208, 475-491.

Faulker et al. (2009) The regulated retroposon transcriptome of mammalian cells. Nature Genetics 41, 563-571.

Liu et al. (1995) Cell stress and translational inhibitors transiently increase the abundance of mammalian SINE transcripts. Nucleic Acids Res. 23, 1758-1765.

Zabolotneva et al. (2010) How many antiviral small interfering RNAs may be encoded by the mammalian genome? Biology Direct 5, 62.

John Harshman said...

Donald Forsdyke: The fatal flaw in this explanation is that we would expect the sequences in question to be conserved, so nobody would consider them junk DNA. Since the great bulk of human sequences (~90%) isn't conserved, we would not expect to find these "RNA antibodies" or other sequences involved in the proposed pathogen response there. Nor, for similar reasons, can it explain the differences in genome size among species.

forsdyke said...

We would expect sequences that functioned against severe pathogens, and/or more frequently encountered pathogens, to be more conserved than sequences that functioned against less severe pathogens, and/or encountered pathogens less frequently. Thus, some sequences would have come under selection a mere several generations before the current generation, and some others, hundreds, or perhaps thousands of generations before the current generation. Present evidence on this, and you might well discover a fatal flaw in this particular anti-junk argument.

John Harshman said...

You describe an odd situation in which selection is powerful enough to fix an adaptive bit of sequence but not enough to preserve its sequence. I'm not sure that needs refutation.