More Recent Comments

Thursday, February 29, 2024

Nils Walter disputes junk DNA: (3) Defining 'gene' and 'function'

I'm discussing a recent paper published by Nils Walter (Walter, 2024). He is trying to explain the conflict between proponents of junk DNA and their opponents. His main focus is building a case for large numbers of non-coding genes.

This is the third post in the series. The first one outlines the issues that led to the current paper and the second one describes Walter's view of a paradigm shift.

-Nils Walter disputes junk DNA: (1) The surprise

-Nils Walter disputes junk DNA: (2) The paradigm shaft

Any serious debate requires some definitions and the debate over junk DNA is no exception. It's important that everyone is on the same page when using specific words and phrases. Nils Walter recognizes this so he begins his paper with a section called "Starting with the basics: Defining 'function' and 'gene'."


Walter gives us two different definitions of a gene. The first one is ....

Generally, a gene is defined as a region of DNA that contains instructions for the function, growth and reproduction of an organism that is genetically inherited by the next generation of the organism.

This is a version of the Mendelian definition of a gene. It's a fuzzy definition often used by evolutionary biologists who are interested in whole organisms and ecology. It's the kind of definition used by people like Richard Dawkins who refer to anything that's "selfish" as a gene.

We're more interested in the molecular gene. Walter's definition of the molecular gene is ....

... so a broadly inclusive definition of a gene is as a region of DNA that expresses an RNA that may or may not be translated.

This is similar to the definition of a gene that's been used by knowledgeable molecular biologists and (most) textbook writers since the mid-1960s. It's the one I've advocated in previous blog posts except that it's missing one very important restriction. My definition is ....

A gene is a DNA sequence that is transcribed to produce a functional product. [What Is a Gene?]

This is the definition I will use in these discussions. I agree with Walter that a gene is a stretch of DNA that's transcribed. That means that there are two types of gene; protein-coding genes and non-coding genes. It means that introns are part of the gene and it means that regulatory elements (usually) lie outside of the gene they regulate.

However, Walter fails to add an important characteristic to his definition of a gene; namely, that the RNA product must have a function. That's necessary because pseudogenes can be transcribed and junk DNA can be transcribed to produce junk RNA. Those DNA sequences aren't genes. I think Nils Walter implicitly recognizes this restriction but it would have been nice if he had mentioned it up front.

All of this, including the difference between the Mendelian and molecular gene, has been incorporated into the Wikipedia article on Gene. This wasn't accomplished without a great deal of effort. Here's some posts on the problems we faced: Must a Gene Have a Function?; and Definition of a gene (again).

The issues around defining a gene are covered in my book in Chapter 6: How Many Genes? How Many Proteins?


The world is not inhabited exclusively by fools and when a subject arouses intense interest and debate, as this one has, something other than semantics is usually at stake.
Stephen Jay Gould (1982)

The correct definition of a gene requires that the RNA product have a function so it's important to define "function" for that reason alone. But there are lots of other functional DNA elements that aren't genes—regulatory sequences, for example—and there are parts of genes that are junk (most intron sequences). The junk DNA debate is crucially dependant on how one defines functional elements and distinguishes them from junk DNA.

The Function Wars are incredibly complex and confusing and philosophers are a long way from reaching a consensus. However, molecular biologists and those interested in molecular evolution have arrived at a view of function that's broadly acceptable. This is why I announced, somewhat naively, that The function wars are over.

My view is that this is the best definition of function ...

Functional DNA is any stretch of DNA whose deletion from the genome would reduce the fitness of the individual.

The best way to identify functional DNA is to see whether it is subject to purifying selection. I prefer the term "maintenance function" to describe this definition of function. In the absence of that direct test, you can use conservation as a proxy. We now have enough examples of human genome sequences to apply the test of purifying selection and the results indicate that less than 10% of the human genome is under purifying selection [Identifying functional DNA (and junk) by purifying selection].

The definition covers stretches of DNA where the actual sequence of nucleotides is important as well as stretches where it is just the presence of bulk DNA that's important (e.g. spacer DNA). The main difficulty with this definition is the possibility that a given stretch of DNA may not be important for the individual organism but may be important for the survival of the population (species). There's no easy way to identify such sequences.

This is covered in my book in Chapter 4: Why Don't Mutations Kill Us? in a section titled "Defining function." It's also covered in Chapter 11: Zen and the Art of Coping with a Sloppy Genome.

Nils Walter thinks that the most important issue concerning junk DNA is that there are two groups of scientists using different definitions of function. He believes that the definition that I offered above is the one preferred by geneticists and evolutionary biologists and it differs from the one preferred by molecular biologists and biochemists. He's not entirely wrong about this since the definition that I support is, indeed, one that population geneticists support, but he's wrong to assume that all biochemists and molecular biologists prefer another definition. Many of the most vocal advocates for junk DNA are biochemists and they acccept the maintenance function definition.

The other view of function—the one preferred by biochemists according to Walter—was proposed by ENCODE researchers.

For those biosciences studying whole organisms or ecosystems, a biological function describes the reason why an organism harbors a particular trait or behavior. In contrast, the molecular biologist and biochemist will think of biological function as a specific biochemical activity that a molecule carries out within the cells of the organism. Such a function may be as rudimentary as the recruitment of a protein by a specific segment of a nucleic acid. Accordingly, the ENCODE team defined RNA elements as functional when they bound one of the plethora of RNA-binding proteins (RBPs) found in the mammalian cell. How that protein-binding RNA element fits into, for example, a specific regulatory pathway and how this pathway then leads to a particular trait or behavior of the organism as a whole is of secondary concern.

This requires some unpacking. The definition he opposes is actually the old-fashioned "selected effect" function requiring that a functional DNA element must be the direct product of natural selection. That definition has been abandoned by many philosophers and scientists because it's too restrictive. Walter doesn't seem to appreciate that the maintenance function only requires that a functional element must be preserved by natural selection today, regardless of how it arose in the past [The Function Wars Part VI: The problem with selected effect function].

Nils Walter is basically correct in describing the ENCODE version of function. What ENCODE said in 2012 is that a DNA element has a function as long as it is associated with any kind of biochemical activity. That includes binding a transcription factor or being transcribed. This definition has been thoroughly refuted beginning with criticisms that were posted on the very same day that the 2012 papers were released. These were followed by a plethora of papers resulting in a retraction by ENCODE who admitted that some transcription was spurious and some transcription factor binding sites occur at random sites in the genome [Required reading for the junk DNA debate].

Walter admits that he doesn't known the actual role of most transcripts and most transcription factor binding sites but that's only of secondary importance. We'll see below that the essence of his position is that these DNA elements must have a function even if we don't yet know what that function is.

I note that Walter is re-opening the debate over "causal role" function and "selected effect" function without mentioning these terms. That's surprising since he must be aware of them.

The data on sequence conservation and purifying selection is consistent with the data on mutation load suggesting that 90% of our genome is junk. The evidence for real biological function based on the ENCODE causal role definition is not as conclusive. So far, only a small fraction of transcripts and transcription factor binding sites have been shown to have a real function. This is a problem for the biochemical definition that Walter prefers so ...

Even today, a survey of human genes finds that most ncRNAs—so pervasively transcribed from the human genome—have no clear function yet. While there are ever more paths toward identifying ncRNA function in service of completing the human gene catalogue, few of these tools are high in throughput, in large part because most ncRNAs participate in specific functional pathways in unique ways. This tediousness leads to the fact that the definition of a broader biological function through hypothesis-driven mechanistic studies by necessity will almost always lag behind the discovery of an RNA sequence element via modern high-throughput sequencing approaches. The absence of evidence, therefore, can be argued not to be the same as evidence of absence of a function.

The argument here is that biochemistry is complicated so it's been hard to prove that the biochemical definition of function is valid. But maybe we'll find this missing function at some time in the future. Meanwhile the fact that biochemists and molecular biologists haven't succeeded after several decades of work should not be taken as evidence that they will never succeed.

It's true that the absence of evidence isn't evidence of absence in the strictest sense of the word but there's a bit more to the issue than that. Not only do we lack evidence that most transcripts are truly functional but we have solid evidence that non-functional, spurious transcripts exist (junk RNA). You can't ignore that part when you're trying to come up with a definition of "function."

Walter, N.G. (2024) Are non‐protein coding RNAs junk or treasure? An attempt to explain and reconcile opposing viewpoints of whether the human genome is mostly transcribed into non‐functional or functional RNAs. BioEssays:2300201. [doi: 10.1002/bies.202300201]


Mark Sturtevant said...

The argument that 'absence of evidence is not evidence for absence' does not end the debate. Since this issue about pervasive function in genomes has been going on for many years, one should be able to give the standard reply that the absence of evidence IS evidence for absence when the evidence should be there by now.

John Harshman said...

One problem with the "absence of evidence" argument is that it's an excuse for ignoring the data we do have as well as a quite simple methodology for making inferences about a large population without exhaustive examination: random sampling. All he needs to do in order to show that most lncRNAs are functionall is to randomly sample a small number of them and do the work to determine whether they're functional. He should thus be able to produce an estimate of the percentage of functional lncRNAs (or whatever other sort of element) in the genome, and the bigger his sample, the smaller the error bars on that estimate. But of course the sample has to be truly random, not chosen based on prior evidence of function. Why isn't he doing that, or at least suggesting such a test?

Larry Moran said...

John Harshman asks, "Why isn't he doing that ...?"

There are two reasons. The first one is that it's very hard to prove a negative. If you were to assign a randomly chosen lncRNA to a postdoc, they might have to spend years looking for a possible function in order to exhaust all possibilities. In order to get a meaningful sample you would need at least 20 postdocs and graduate students.

Given what we suspect, none of the postdocs would ever get a job and none of the graduate students would get a PhD. And such a fool's errand would cost a lot of money.

The second reason is that most RNA labs are bringing in a lot of grant money by promoting the idea that there are thousands of mysterious RNAs out there and most of them are going to be connected (eventually) to some genetic disease. All these labs need is more money to show the connection. It is not in their best interests to do experiments showing that most of those RNAs are just spurious junk.

Mikkel Rumraket Rasmussen said...

The idea that there could still be function in stuff, we just haven't found it yet, has basically made it unfalsifiable. What experiment would convince them the locus isn't functional, if absence of evidence isn't evidence of absence?
When we look for something and don't find it, that's an indication it isn't there. The harder we look, the stronger that indication becomes.

We already have good ways of looking. One of those is sequence conservation, another is genome size comparisons, a third still is the molecular nature of the locus. We know what a gene that once used to function as a protein coding gene looks like, we know what a functional retrovirus looks like, etc. They look like things that used to work but broke, they are not conserved, and they're mostly transposons and dead viruses. All of that is evidence of non-function. Not just mere absence of evidence. They're evidence of absence.

John Harshman said...

Walter disposes of both the C-value paradox and sequence conservation in that paper. But if you can figure out just what his argument is for either, you can try explaining it to me.

John Harshman said...

We know what a gene that once used to function as a protein coding gene looks like, we know what a functional retrovirus looks like, etc. They look like things that used to work but broke, they are not conserved, and they're mostly transposons and dead viruses. All of that is evidence of non-function.

It should be pointed out that some of this evidence is wrong. Some pseudogenes have a function, and if that function is recently acquired they may show no conservation. Same with broken transposons and such. But the question is not whether the evidence is perfect but whether that's the way to vet.

John Harshman said...

That's "bet".

Mikkel Rumraket Rasmussen said...