More Recent Comments

Showing posts sorted by date for query junk dna. Sort by relevance Show all posts
Showing posts sorted by date for query junk dna. Sort by relevance Show all posts

Saturday, November 19, 2022

How many enhancers in the human genome?

In spite of what you might have read, the human genome does not contain one million functional enhancers.

The Sept. 15, 2022 issue of Nature contains a news article on "Gene regulation" [Two-layer design protects genes from mutations in their enhancers]. It begins with the following sentence.

The human genome contains only about 20,000 protein-coding genes, yet gene expression is controlled by around one million regulatory DNA elements called enhancers.

Sandwalk readers won't need to be told the reference for such an outlandish claim because you all know that it's the ENCODE Consortium summary paper from 2012—the one that kicked off their publicity campaign to convince everyone of the death of junk DNA (ENCODE, 2012). ENCODE identified several hundred thousand transcription factor (TF) binding sites and in 2012 they estimated that the total number of base pairs invovled in regulating gene expression could account for 20% of the genome.

How many of those transcription factor binding sites are functional and how many are due to spurious binding to sites that have nothing to do with gene regulation? We don't know the answer to that question but we do know that there will be a huge number of spurious binding sites in a genome of more than three billion base pairs [Are most transcription factor binding sites functional?].

The scientists in the ENCODE Consortium didn't know the answer either but what's surprising is that they didn't even know there was a question. It never occured to them that some of those transcription factor binding sites have nothng to do with regulation.

Fast forward ten years to 2022. Dozens of papers have been published criticizing the ENCODE Consortium for their stupidity lack of knowledge of the basic biochemical properties of DNA binding proteins. Surely nobody who is interested in this topic believes that there are one million functional regulatory elements (enhancers) in the human genome?

Wrong! The authors of this Nature article, Ran Elkon at Tel Aviv University (Israel) and Reuven Agami at the Netherlands Cancer Institute (Amsterdam, Netherlands), didn't get the message. They think it's quite plausible that the expression of every human protein-coding gene is controlled by an average of 50 regulatory sites even though there's not a single known example any such gene.

Not only that, for some reason they think it's only important to mention protein-coding genes in spite of the fact that the reference they give for 20,000 protein-coding genes (Nurk et al., 2022) also claims there are an additional 40,000 noncoding genes. This is an incorrect claim since Nurk et al. have no proof that all those transcribed regions are actually genes but let's play along and assume that there really are 60,000 genes in the human genome. That reduces the average number of enhancers to an average of "only" 17 enhancers per gene. I don't know of a single gene that has 17 or more proven enhancers, do you?

Why would two researchers who study gene regulation say that the human genome contains one million enhancers when there's no evidence to support such a claim and it doesn't make any sense? Why would Nature publish this paper when surely the editors must be aware of all the criticism that arose out of the 2012 ENCODE publicity fiasco?

I can think of only two answers to the first question. Either Elkon and Agami don't know of any papers challenging the view that most TF binding sites are functional (see below) or they do know of those papers but choose to ignore them. Neither answer is acceptable.

I think that the most important question in human gene regulation is how much of the genome is devoted to regulation. How many potential regulatory sites (enhancers) are functional and how many are spurious non-functional sites? Any paper on regulation that does not mention this problem should not be published. All results have to interpreted in light of conflicting claims about function.

Here are some example of papers that raise the issue. The point is not to prove that these authors are correct - although they are correct - but to show that there's a controvesy. You can't just state that there are one million regulatory sites as if it were a fact when you know that the results are being challenged.

"The observations in the ENCODE articles can be explained by the fact that biological systems are noisy: transcription factors can interact at many nonfunctional sites, and transcription initiation takes place at different positions corresponding to sequences similar to promoter sequences, simply because biological systems are not tightly controlled." (Morange, 2014)

"... ENCODE had not shown what fraction of these activities play any substantive role in gene regulation, nor was the project designed to show that. There are other well-studied explanations for reproducible biochemical activities besides crucial human gene regulation, including residual activities (pseudogenes), functions in the molecular features that infest eukaryotic genomes (transposons, viruses, and other mobile elements), and noise." (Eddy, 2013)

"Given that experiments performed in a diverse number of eukaryotic systems have found only a small correlation between TF-binding events and mRNA expression, it appears that in most cases only a fraction of TF-binding sites significantly impacts local gene expression." (Palazzo and Gregory, 2014)

One surprising finding from the early genome-wide ChIP studies was that TF binding is widespread, with thousand to tens of thousands of binding events for many TFs. These number do not fit with existing ideas of the regulatory network structure, in which TFs were generally expected to regulate a few hundred genes, at most. Binding is not necessarily equivalent to regulation, and it is likely that only a small fraction of all binding events will have an important impact on gene expression. (Slattery et al., 2014)

Detailed maps of transcription factor (TF)-bound genomic regions are being produced by consortium-driven efforts such as ENCODE, yet the sequence features that distinguish functional cis-regulatory sites from the millions of spurious motif occurrences in large eukaryotic genomes are poorly understood. (White et al., 2013)

One outstanding issue is the fraction of factor binding in the genome that is "functional", which we define here to mean that disturbing the protein-DNA interaction leads to a measurable downstream effect on gene regulation. (Cusanovich et al., 2014)

... we expect, for example, accidental transcription factor-DNA binding to go on at some rate, so assuming that transcription equals function is not good enough. The null hypothesis after all is that most transcription is spurious and alterantive transcripts are a consequence of error-prone splicing. (Hurst, 2013)

... as a chemist, let me say that I don't find the binding of DNA-binding proteins to random, non-functional stretches of DNA surprising at all. That hardly makes these stretches physiologically important. If evolution is messy, chemistry is equally messy. Molecules stick to many other molecules, and not every one of these interactions has to lead to a physiological event. DNA-binding proteins that are designed to bind to specific DNA sequences would be expected to have some affinity for non-specific sequences just by chance; a negatively charged group could interact with a positively charged one, an aromatic ring could insert between DNA base pairs and a greasy side chain might nestle into a pocket by displacing water molecules. It was a pity the authors of ENCODE decided to define biological functionality partly in terms of chemical interactions which may or may not be biologically relevant. (Jogalekar, 2012)


Nurk, S., Koren, S., Rhie, A., Rautiainen, M., Bzikadze, A. V., Mikheenko, A., et al. (2022) The complete sequence of a human genome. Science, 376:44-53. [doi:10.1126/science.abj6987]

The ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489:57-74. [doi: 10.1038/nature11247]

Saturday, November 05, 2022

Nature journalist is confused about noncoding RNAs and junk

Nature Methods is one of the journals in Nature Portfolio published by Springer Nature. Its focus is novel methods in the life sciences.

The latest issue (October, 2022) highlights the issues with identifying functional noncoding RNAs and the editorial, Decoding noncoding RNAs, is quite good—much better than the comments in other journals. Here's the final paragraph.

Despite the increasing prominence of ncRNA, we remind readers that the presence of a ncRNA molecule does not always imply functionality. It is also possible that these transcripts are non-functional or products from, for example, splicing errors. We hope this Focus issue will provide researchers with practical advice for deciphering ncRNA’s roles in biological processes.

However, this praise is mitigated by the appearance of another article in the same journal. Science journalist, Vivien Marx has written a commentary with a title that was bound to catch my eye: How noncoding RNAs began to leave the junkyard. Here's the opening paragraph.

Junk. In the view of some, that’s what noncoding RNAs (ncRNAs) are — genes that are transcribed but not translated into proteins. With one of his ncRNA papers, University of Queensland researcher Tim Mercer recalls that two reviewers said, “this is good” and the third said, “this is all junk; noncoding RNAs aren’t functional.” Debates over ncRNAs, in Mercer’s view, have generally moved from ‘it’s all junk’ to ‘which ones are functional?’ and ‘what are they doing?’

This is the classic setup for a paradigm shaft. What you do is create a false history of a field and then reveal how your ground-breaking work has shattered the long-standing paradigm. In this case, the false history is that the standard view among scientists was that ALL noncoding RNAs were junk. That's nonsense. It means that these old scientists must have dismissed ribosomal RNA and tRNA back in the 1960s. But even if you grant that those were exceptions, it means that they knew nothing about Sidney Altman's work on RNAse P (Nobel Prize, 1989), or 7SL RNA (Alu elements), or the RNA components of spliceosomes (snRNAs), or PiWiRNAs, or snoRNAs, or microRNAs, or a host of regulatory RNAs that have been known for decades.

Knowledgeable scientists knew full well that there are many functional noncoding RNAS and that includes some that are called lncRNAs. As the editorial says, these knowledgeable scientists are warning about attributing function to all transcripts without evidence. In other words, many of the transcripts found in human cells could be junk RNA in spite of the fact that there are also many functional nonciding RNAs.

So, Tim Mercer is correct, the debate is over which ncRNAs are functional and that's the same debate that's been going on for 50 years. Move along folks, nothing to see here.

The author isn't going to let this go. She decides to interview John Mattick, of all people, to get a "proper" perspective on the field. (Tim Mercer is a former student of Mattick's.) Unfortunately, that perspective contains no information on how many functional ncRNAs are present and on what percentage of the genome their genes occupy. It's gonna take several hundred thousand lncRNA genes to make a significant impact on the amount of junk DNA but nobody wants to say that. With John Mattick you get a twofer: a false history (paradigm strawman) plus no evidence that your discoveries are truly revolutionary.

Nature Methods should be ashamed, not for presenting the views of John Mattick—that's perfectly legitimate—but for not putting them in context and presenting the other side of the controversy. Surely at this point in time (2022) we should all know that Mattick's views are on the fringe and most transcripts really are junk RNA?


Monday, October 17, 2022

University press releases are a major source of science misinformation

Here's an example of a press release that distorts science by promoting incorrect information that is not found in the actual publication.

The problems with press releases are well-known but nobody is doing anything about it. I really like the discussion in Stuart Ritchie's recent (2020) book where he begins with the famous "arsenic affair" in 2010. Sandwalk readers will recall that this started with a press conference by NASA announcing that arsenic replaces phosphorus in the DNA of some bacteria. The announcement was treated with contempt by the blogosphere and eventually the claim was discproved by Rosie Redfield who showed that the experiment was flawed [The Arsenic Affair: No Arsenic in DNA!].

This was a case where the science was wrong and NASA should have known before it called a press conference. Ritchie goes on to document many cases where press releases have distorted the science in the actual publication. He doesn't mention the most egregious example, the ENCODE publicity campaign that successfully convinced most scientists that junk DNA was dead [The 10th anniversary of the ENCODE publicity campaign fiasco].

I like what he says about "churnalism" ...

In an age of 'churnalism', where time-pressed journalists often simply repeat the content of press releases in their articles (science news reports are often worded vitrually identically to a press release), scientists have a great deal of power—and a great deal of responsibility. The constraints of peer review, lax as they might be, aren't present at all when engaging with the media, and scientists' biases about the importance of their results can emerge unchecked. Frustratingly, once the hype bubble has been inflated by a press release, it's difficult to burst.

Press releases of all sorts are failing us but university press releases are the most disappointing because we expect universities to be credible sources of information. It's obvious that scientists have to accept the blame for deliberately distorting their findings but surely the information offices at universities are also at fault? I once suggested that every press release has to include a statement, signed by the scientists, saying that the press release accurately reports the results and conclusions that are in the published article and does not contain any additional information or speculation that has not passed peer review.

Let's look at a recent example where the scientists would not have been able to truthfully sign such a statement.

A group of scientists based largely at The University of Sheffield in Sheffield (UK) recently published a paper in Nature on DNA damage in the human genome. They noted that such damage occurs preferentially at promoters and enhancers and is associated with demethylation and transcription activation. They presented evidence that the genome can be partially protected by a protein called "NuMA." I'll show you the abstract below but for now that's all you need to know.

The University of Sheffield decided to promote itself by issuing a press release: Breaks in ‘junk’ DNA give scientists new insight into neurological disorders. This title is a bit of a surprise since the paper only talks about breaks in enhancers and promoters and the word "junk" doesn't appear anywhere in the published report in Nature.

The first paragraph of the press release isn' very helpful.

‘Junk’ DNA could unlock new treatments for neurological disorders as scientists discover how its breaks and repairs affect our protection against neurological disease.

What could this mean? Surely they don't mean to imply that enhancers and promoters are "junk DNA"? That would be really, really, stupid. The rest of the press release should explain what they mean.

The groundbreaking research from the University of Sheffield’s Neuroscience Institute and Healthy Lifespan Institute gives important new insights into so-called junk DNA—or DNA previously thought to be non-essential to the coding of our genome—and how it impacts on neurological disorders such as Motor Neurone Disease (MND) and Alzheimer’s.

Until now, the body’s repair of junk DNA, which can make up 98 per cent of DNA, has been largely overlooked by scientists, but the new study published in Nature found it is much more vulnerable to breaks from oxidative genomic damage than previously thought. This has vital implications on the development of neurological disorders.

Oops! Apparently, they really are that stupid. The scientists who did this work seem to think that 98% of our genome is junk and that includes all the regulatory sequences. It seems like they are completely unaware of decades of work on discovering the function of these regulatory sequences. According The University of Sheffield, these regulatory sequences have been "largely overlooked by scientists." That will come as a big surprise to many of my colleagues who worked on gene regulation in the 1980s and in all the decades since then. It will probably also be a surprise to biochemistry and molecular biology undergraduates at Sheffield—at least I hope it will be a surprise.

Professor Sherif El-Khamisy, Chair in Molecular Medicine at the University of Sheffield, Co-founder and Deputy Director of the Healthy Lifespan Institute, said: “Until now the repair of what people thought is junk DNA has been mostly overlooked, but our study has shown it may have vital implications on the onset and progression of neurological disease."

I wonder if Professor Sherif El-Khamisy can name a single credible scientist who thinks that regulatory sequences are junk DNA?

There's no excuse for propagating this kind of misinformation about junk DNA. It's completely unnecessary and serves only to discredit the university and its scientists.

Ray, S., Abugable, A.A., Parker, J., Liversidge, K., Palminha, N.M., Liao, C., Acosta-Martin, A.E., Souza, C.D.S., Jurga, M., Sudbery, I. and El-Khamisy, S.F. (2022) A mechanism for oxidative damage repair at gene regulatory elements. Nature, 609:1038-1047. doi:[doi: 10.1038/s41586-022-05217-8]

Oxidative genome damage is an unavoidable consequence of cellular metabolism. It arises at gene regulatory elements by epigenetic demethylation during transcriptional activation1,2. Here we show that promoters are protected from oxidative damage via a process mediated by the nuclear mitotic apparatus protein NuMA (also known as NUMA1). NuMA exhibits genomic occupancy approximately 100 bp around transcription start sites. It binds the initiating form of RNA polymerase II, pause-release factors and single-strand break repair (SSBR) components such as TDP1. The binding is increased on chromatin following oxidative damage, and TDP1 enrichment at damaged chromatin is facilitated by NuMA. Depletion of NuMA increases oxidative damage at promoters. NuMA promotes transcription by limiting the polyADP-ribosylation of RNA polymerase II, increasing its availability and release from pausing at promoters. Metabolic labelling of nascent RNA identifies genes that depend on NuMA for transcription including immediate–early response genes. Complementation of NuMA-deficient cells with a mutant that mediates binding to SSBR, or a mitotic separation-of-function mutant, restores SSBR defects. These findings underscore the importance of oxidative DNA damage repair at gene regulatory elements and describe a process that fulfils this function.


Monday, September 05, 2022

The 10th anniversary of the ENCODE publicity campaign fiasco

On Sept. 5, 2012 ENCODE researchers, in collaboration with the science journal Nature, launched a massive publicity campaign to convince the world that junk DNA was dead. We are still dealing with the fallout from that disaster.

The Encyclopedia of DNA Elements (ENCODE) was originally set up to discover all of the functional elements in the human genome. They carried out a massive number of experiments involving a huge group of researchers from many different countries. The results of this work were published in a series of papers in the September 6th, 2012 issue of Nature. (The papers appeared on Sept. 5th.)

Sunday, September 04, 2022

Wikipedia: the ENCODE article

The ENCODE article on Wikipedia is a pretty good example of how to write a science article. Unfortunately, there are a few issues that will be very difficult to fix.

When Wikipedia was formed twenty years ago, there were many people who were skeptical about the concept of a free crowdsourced encyclopedia. Most people understood that a reliable source of information was needed for the internet because the traditional encyclopedias were too expensive, but could it be done by relying on volunteers to write articles that could be trusted?

The answer is mostly “yes” although that comes with some qualifications. Many science articles are not good; they contain inaccurate and misleading information and often don’t represent the scientific consensus. They also tend to be disjointed and unreadable. On the other hand, many non-science articles are at least as good, and often better, than anything in the traditional encyclopedias (eg. Battle of Waterloo; Toronto, Ontario; The Beach Boys).

By 2008, Wikipedia had expanded enormously and the quality of articles was being compared favorably to those of Encyclopedia Britannica, which had been forced to go online to compete. However, this comparison is a bit unfair since it downplays science articles.

Monday, August 29, 2022

The creationist view of junk DNA

Here's a recent video podcast (Aug. 23, 1022) from the Institute for Creation Research (sic). It features an interview with Dr. Jeff Tomkins of the ICR where he explains the history of junk DNA and why scientists no longer believe in junk DNA.

Most Sandwalk readers will recognize all the lies and distortions but here's the problem: I suspect that the majority of biologists would pretty much agree with the creationist interpretation. They also believe that junk DNA has been refuted and most of our genome is functional.

That's very sad.


Friday, August 26, 2022

ENCODE and their current definition of "function"

ENCODE has mostly abandoned it's definition of function based on biochemical activity and replaced it with "candidate" function or "likely" function, but the message isn't getting out.

Back in 2012, the ENCODE Consortium announced that 80% of the human genome was functional and junk DNA was dead [What did the ENCODE Consortium say in 2012?]. This claim was widely disputed, causing the ENCODE Consortium leaders to back down in 2014 and restate their goal (Kellis et al. 2014). The new goal is merely to map all the potential functional elements.

... the Encyclopedia of DNA Elements Project [ENCODE] was launched to contribute maps of RNA transcripts, transcriptional regulator binding sites, and chromatin states in many cell types.

The new goal was repeated when the ENCODE III results were published in 2020, although you had to read carefully to recognize that they were no longer claiming to identify functional elements in the genome and they were raising no objections to junk DNA [ENCODE 3: A lesson in obfuscation and opaqueness].

Wednesday, August 24, 2022

Junk DNA vs noncoding DNA

The Wikipedia article on the Human genome contained a reference that I had not seen before.

"Finally DNA that is deleterious to the organism and is under negative selective pressure is called garbage DNA.[43]"

Reference 43 is a chapter in a book.

Pena S.D. (2021) "An Overview of the Human Genome: Coding DNA and Non-Coding DNA". In Haddad LA (ed.). Human Genome Structure, Function and Clinical Considerations. Cham: Springer Nature. pp. 5–7. ISBN 978-3-03-073151-9.

Sérgio Danilo Junho Pena is a human geneticist and professor in the Dept. of Biochemistry and Immunology at the Federal University of Minas Gerais in Belo Horizonte, Brazil. He is a member of the Human Genome Organization council. If you click on the Wikipedia link, it takes you to an excerpt from the book where S.D.J. Pena discusses "Coding and Non-coding DNA."

There are two quotations from that chapter that caught my eye. The first one is,

"Less than 2% of the human genome corresponds to protein-coding genes. The functional role of the remaining 98%, apart from repetitive sequences (constitutive heterochromatin) that appear to have a structural role in the chromosome, is a matter of controversy. Evolutionary evidence suggests that this noncoding DNA has no function—hence the common name of 'junk DNA.'"

Professor Pena then goes on to discuss the ENCODE results pointing out that there are many scientists who disagree with the conclusion that 80% of our genome is functional. He then says,

"Many evolutionary biologists have stuck to their guns in defense of the traditional and evolutionary view that non-coding DNA is 'junk DNA.'"

This is immediately followed by a quote from Dan Graur, implying that he (Graur) is one of the evolutionary biologists who defend the evolutionary view that noncoding DNA is junk.

I'm very interested in tracking down the reason for equating noncoding DNA and junk DNA, especially in contexts where the claim is obviously wrong. So I wrote to Professor Pena—he got his Ph.D. in Canada—and asked him for a primary source that supports the claim that "evolutionary science suggests that this noncoding DNA has no function."

He was kind enough to reply saying that there are multiple sources and he sent me links to two of them. Here's the first one.

I explained that this was somewhat ironic since I had written most of the Wikipedia article on Non-coding DNA and my goal was to refute the idea than noncoding DNA and junk DNA were synonyms. I explained that under the section on 'junk DNA' he would see the following statement that I inserted after writing sections on all those functional noncoding DNA elements.

"Junk DNA is often confused with non-coding DNA[48] but, as documented above, there are substantial fractions of non-coding DNA that have well-defined functions such as regulation, non-coding genes, origins of replication, telomeres, centromeres, and chromatin organizing sites (SARs)."

That's intended to dispel the notion that proponents of junk DNA ever equated noncoding DNA and junk DNA. I suggested that he couldn't use that source as support for his statement.

Here's my response to his second source.

The second reference is to a 2007 article by Wojciech Makalowski,1 a prominent opponent of junk DNA. He says, "In 1972 the late geneticist Susumu Ohno coined the term "junk DNA" to describe all noncoding sections of a genome" but that is a demonstrably false statement in two respects.

First, Ohno did not coin the term "junk DNA" - it was commonly used in discussions about genomes and even appeared in print many years before Ohno's paper. Second, Ohno specifically addresses regulatory sequences in his paper so it's clear that he knew about functional noncoding DNA that was not junk. He also mentions centromeres and I think it's safe to assume that he knew about ribosomal RNA genes and tRNA genes.

The only possible conclusion is that Makalowski is wrong on two counts.

I then asked about the second statement in Professor Pena's article and suggested that it might have been much better to say, "Many evolutionary biologists have stuck to their guns and defend the view that most of human genome is junk." He agreed.

So, what have we learned? Professor Pena is a well-respected scientist and an expert on the human genome. He is on the council of the Human Genome Organization. Yet, he propagated the common myth that noncoding DNA is junk and saw nothing wrong with Makalowski's false reference to Susumu Ohno. Professor Pena himself must be well aware of functional noncoding elements such as regulatory sequences and noncoding genes so it's difficult explain why he would imagine that prominant defenders of junk DNA don't know this.

I think the explanation is that this connection between noncoding DNA and junk DNA is so entrenched in the popular and scientific literature that it is just repeated as a meme without ever considering whether it makes sense.


1. The pdf appears to be a response to a query in Scientific American on February 12, 2007. It may be connected to a Scientific American paper by Khajavinia and Makalowski (2007).

Khajavinia, A., and Makalowski, W. (2007) What is" junk" DNA, and what is it worth? Scientific American, 296:104. [PubMed]

Monday, August 22, 2022

Is every gene associated with cancer?

There has been an enormous expansion of papers on cancer and many of them make a connection with a particular human gene. A recent note in Trends in Genetics revealed that 15,233 human genes have already been mentioned in a cancer paper (de Magalhães, 2022). (I'm pretty sure the author is only referring to portein-coding genes).

The author notes that this association doesn't necessarily mean that there's a cause-and-effect relationship and also notes that justifying a connection between your favorite gene and a cancer grant application is a factor. However, he concludes that,

In genetics and genomics, literally everything is associated with cancer. If a gene has not been associated with cancer yet, it probably means it has not been studied enough and will most likely be associated with cancer in the future. In a scientific world where everything and every gene can be associated with cancer, the challenge is determining which are the key drivers of cancer and more promising therapeutic targets.

I think he's on to something. I predict that all noncoding genes will eventually be associated with cancer as well. Not only that, I predict that several thousand fake genes will also be associated with cancer. It won't be long before there are 100,000 human genes associated with cancer and then the remaining parts of the genome will also be mentioned in cancer papers.

This will mean the end of junk DNA because anything that causes cancer must be part of a functional genome.

I hope my book comes out before this becomes widely known.


de Magalhães, J.P. (2021) Every gene can (and possibly will) be associated with cancer. TRENDS in Genetics 38:216-217 [doi: 10.1016/j.tig.2021.09.005]

Saturday, August 20, 2022

Editing the 'Intergenic region' article on Wikipedia

Just before getting banned from Wikipedia, I was about to deal with a claim on the Intergenic region article. I had already fixed most of the other problems but there is still this statement in the subsection labeled "Properties."

According to the ENCODE project's study of the human genome, due to "both the expansion of genic regions by the discovery of new isoforms and the identification of novel intergenic transcripts, there has been a marked increase in the number of intergenic regions (from 32,481 to 60,250) due to their fragmentation and a decrease in their lengths (from 14,170 bp to 3,949 bp median length)"[2]

The source is one of the ENCODE papers published in the September 6 edition of Nature (Djebali et al., 2012). The quotation is accurate. Here's the full quotation.

As a consequence of both the expansion of genic regions by the discovery of new isoforms and the identification of novel intergenic transcripts, there has been a marked increase in the number of intergenic regions (from 32,481 to 60,250) due to their fragmentation and a decrease in their lengths (from 14,170 bp to 3,949 bp median length.

What's interesting about that data is what it reveals about the percentage of the genome devoted to intergenic DNA and the percentage devoted to genes. The authors claim that there are 60,250 intergenic regions, which means that there must be more than 60,000 genes.1 The median length of these intergenic regions is 3,949 bp and that means that roughly 204.5 x 106 bp are found in intergenic DNA. That's roughly 7% of the genome depending on which genome size you use. It doesn't mean that all the rest is genes but it sounds like they're saying that about 90% of the genome is occupied by genes.

In case you doubt that's what they're saying, read the rest of the paragraph in the paper.

Concordantly, we observed an increased overlap of genic regions. As the determination of genic regions is currently defined by the cumulative lengths of the isoforms and their genetic association to phenotypic characteristics, the likely continued reduction in the lengths of intergenic regions will steadily lead to the overlap of most genes previously assumed to be distinct genetic loci. This supports and is consistent with earlier observations of a highly interleaved transcribed genome, but more importantly, prompts the reconsideration of the definition of a gene.

It sounds like they are anticipating a time when the discovery of more noncoding genes will eventually lead to a situation where the intergenic regions disappear and all genes will overlap.

Now, as most of you know, the ENCODE papers have been discredited and hardly any knowledgeable scientist thinks there are 60,000 genes that occupy 90% of the genome. But here's the problem. I probably couldn't delete that sentence from Wikipedia because it meets all the criteria of a reliable source (published in Nature by scientists from reputable universities). Recent experience tells me that the Wikipolice Wikipedia editors would have blocked me from deleting it.

The best I could do would be to balance the claim with one from another "reliable source" such as Piovasan et al. (2019) who list the total number of exons and introns and their average sizes allowing you to calculate that protein-coding genes occupy about 35% of the genome. Other papers give slightly higher values for protein-coding genes.

It's hard to get a reliable source on the real number of noncoding genes and their average size but I estimate that there are about 5,000 genes and a generous estimate that they could take up a few percent of the genome. I assume in my upcoming book that genes probably occupy about 45% of the genome because I'm trying to err on the side of function.

An article on Intergenic regions is not really the place to get into a discussion about the number of noncoding genes but in the absence of such a well-sourced explanation the audience will be left with the statement from Djebali et al. and that's extremely misleading. Thus, my preference would be to replace it with a link to some other article where the controversy can be explained, preferably a new article on junk DNA.2

I was going to say,

The total amount of intergenic DNA depends on the size of the genome, the number of genes, and the length of each gene. That can vary widely from species to species. The value for the human genome is controversial because there is no widespread agreement on the number of genes but it's almost certain that intergenic DNA takes up at least 40% of the genome.

I can't supply a specific reference for this statement so it would never have gotten past the Wikipolice Wikpipedia editors. This is a problem that can't be solved because any serious attempt to fix it will probably lead to getting blocked on Wikipedia.

There is one other statement in that section in the article on Intergenic region.

Scientists have now artificially synthesized proteins from intergenic regions.[3]

I would have removed that statement because it's irrelevant. It does not contribute to understanding intergenic regions. It's undoubtedly one of those little factoids that someone has stumbled across and thinks it needs to be on Wikipedia.

Deletion of a statement like that would have met with fierce resistance from the Wikipedia editors because it is properly sourced. The reference is to a 2009 paper in the Journal of Biological Engineering: "Synthesizing non-natural parts from natural genomic template."


1. There are no intergenic regions between the last genes on the end of a chromosome and the telomeres.

2. The Wikipedia editors deleted the Junk DNA article about ten years ago on the grounds that junk DNA had been disproven.

Djebali, S., Davis, C. A., Merkel, A., Dobin, A., Lassmann, T., Mortazavi, A. et al. (2012) Landscape of transcription in human cells. Nature 489:101-108. [doi: 10.1038/nature11233]

Piovesan, A., Antonaros, F., Vitale, L., Strippoli, P., Pelleri, M. C., and Caracausi, M. (2019) Human protein-coding genes and gene feature statistics in 2019. BMC research notes 12:315. [doi: 10.1186/s13104-019-4343-8]

Thursday, August 18, 2022

The trouble with Wikipedia

I used to think that Wikipedia was a pretty good source of information even for scientific subjects. It wasn't perfect, but most of the articles could be fixed.

I was wrong. It took me more than two months to make the article on Non-coding DNA acceptable and my changes met with considerable resistance. Along the way, I learned that the old article on Junk DNA had been deleted ten years ago because the general scientific consensus was that junk DNA doesn't exist. So I started to work on a new "Junk DNA" article only to discover that it was going to be very difficult to get it approved. The powerful cult of experienced Wikipedia editors were clearly going to withhold approval of a new article on that subject.

I tried editing some other articles in order to correct misinformation but I ran into the same kind of resistance [see Allele, Gene, Human genome, Evolution, Alternative splicing, Intron]. Frequently, strange editors pop out the woodwork to restore (revert) my attempts on the grounds that I was refuting well-sourced information. I even had one editor named tgeorgescu tell me that, "Friend, Wikipedians aren’t interested in what you know. They are interested in what you can cite, i.e. reliable sources."

How can you tell which sources are reliable unless you know something about the subject?

Much of this bad behavior is covered in a Wikipedia article on Why Wikipedia is not so great. Here's the part that concerns me the most.

People revert edits without explaining themselves (Example: an edit on Economics) (a proper explanation usually works better on the talk page than in an edit summary). Then, when somebody reverts, also without an explanation, an edit war often results. There's not enough grounding in Wikiquette to explain that reverts without comments are inconsiderate and almost never justified except for spam and simple vandalism, and even in those cases comments need to be made for tracking purposes.

There's a culture of hostility and conflict rather than of good will and cooperation. Even experienced Wikipedians fail to assume good faith in their collaborators. It seems fighting off perceived intruders and making egotistical reversions are a higher priority than incorporating helpful collaborators into Wikipedia's community. Glaring errors and omissions are completely ignored by veteran Wikiholics (many of whom pose as scientists, for example, but have no verifiable credentials) who have nothing to contribute but egotistical reverts.

In another article on Criticism of Wikipedia the contributors raise a number of issues including the bad behavior of the cult of long-time Wikipedia editors. It also points out that anonymous editors who refuse to reveal their identify and areas of expertise leads to a lack of accountability.

This sort of behavior is frustrating and it has an effect. Well-meaning scientists are quickly discouraged from fixing articles because of all the hassle they have to go through.

I now see that the problem can't be easily fixed and Wikipedia science articles are not reliable.


Friday, August 12, 2022

The surprising (?) conservation of noncoding DNA

We've known for more than half-a-century that a lot of noncoding DNA is functional. Why are some people still surprised? It's a puzzlement.

A paper in Trends in Genetics caught my eye as I was looking for somethng else. The authors review the various functions of noncoding DNA such as regulatory sequences and noncoding genes. There's nothing wrong with that but the context is a bit shocking for a paper that was published in 2021 in a highly respected journal.

Leypold, N.A. and Speicher, M.R. (2021) Evolutionary conservation in noncoding genomic regions. TRENDS in Genetics 37:903-918. [doi: 10.1016/j.tig.2021.06.007]

Humans may share more genomic commonalities with other species than previously thought. According to current estimates, ~5% of the human genome is functionally constrained, which is a much larger fraction than the ~1.5% occupied by annotated protein-coding genes. Hence, ~3.5% of the human genome comprises likely functional conserved noncoding elements (CNEs) preserved among organisms, whose common ancestors existed throughout hundreds of millions of years of evolution. As whole-genome sequencing emerges as a standard procedure in genetic analyses, interpretation of variations in CNEs, including the elucidation of mechanistic and functional roles, becomes a necessity. Here, we discuss the phenomenon of noncoding conservation via four dimensions (sequence, regulatory conservation, spatiotemporal expression, and structure) and the potential significance of CNEs in phenotype variation and disease.

Thursday, August 04, 2022

Identifying functional DNA (and junk) by purifying selection

Functional DNA is best defined as DNA that is currently under purifying selection. In other words, it can't be deleted without affecting the fitness of the individual. This is the "maintenance function" definition and it differs from the "causal role" and "selected effect" definitions [The Function Wars Part IX: Stefan Linquist on Causal Role vs Selected Effect].

It has always been difficult to determine whether a given sequence is under purifying selection so sequence conservation is often used as a proxy. This is perfectly justifiable since the two criteria are strongly correlated. As a general rule, sequences that are currently being maintained by selection are ancient enough to show evidence of conservation. The only exceptions are de novo sequences and sequences that have recently become expendable and these are rare.

Sunday, July 31, 2022

Junk DNA causes cancer

This is a story about misleading press releases. The spread of misinformation by press offices is a serious issue that needs to be addressed.

The Institute of Cancer Research in London (UK) published a press release on July 19, 2022 with the provocative title: ‘Junk’ DNA could lead to cancer by stopping copying of DNA. The first three sentences tell most of the story.

Scientists have found that non-coding ‘junk’ DNA, far from being harmless and inert, could potentially contribute to the development of cancer.

Their study has shown how non-coding DNA can get in the way of the replication and repair of our genome, potentially allowing mutations to accumulate.

It has been previously found that non-coding or repetitive patterns of DNA – which make up around half of our genome – could disrupt the replication of the genome.

Nobody ever said that junk DNA was "inert and harmless;" in fact it is assumed to be slightly deleterious and only gets fixed because it is invisible to natural selection in small populations (Nearly Neutral Theory). And no intelligent scientist equates noncoding DNA and junk DNA, even by implication. But in any case, this article isn't about all junk DNA, it's about certain small stretches of repetitive DNA that interfere with replication so that the resulting mutations have to be fixed by repair mechanisms. The most likely sequences to interfere with replication are repeats of CG or (CG)n repeats. As the authors point out in the discussion, these repeats are "extremely rare" in all genomes, including the human genome, suggesting that they are under negative selection.

Other, more common, repeats also show detectable in vitro interference with replisomes at replication forks. The errors introduced by replication stalling can be repaired but some of them will escape repair causing mutations. It's not clear to me why mutations in junk DNA are a problem. That's not explained in the paper.

Here's the paper.

Casas-Delucchi, C.S., Daza-Martin, M., Williams, S.L. et al. (2022) Mechchanism of replication stalling and recovery within repetitive DNA. Nat Commun 13:3953 [doi: 10.1038/s41467-022-31657-x]

Accurate chromosomal DNA replication is essential to maintain genomic stability. Genetic evidence suggests that certain repetitive sequences impair replication, yet the underlying mechanism is poorly defined. Replication could be directly inhibited by the DNA template or indirectly, for example by DNA-bound proteins. Here, we reconstitute replication of mono-, di- and trinucleotide repeats in vitro using eukaryotic replisomes assembled from purified proteins. We find that structure-prone repeats are sufficient to impair replication. Whilst template unwinding is unaffected, leading strand synthesis is inhibited, leading to fork uncoupling. Synthesis through hairpin-forming repeats is rescued by replisome-intrinsic mechanisms, whereas synthesis of quadruplex-forming repeats requires an extrinsic accessory helicase. DNA-induced fork stalling is mechanistically similar to that induced by leading strand DNA lesions, highlighting structure-prone repeats as an important potential source of replication stress. Thus, we propose that our understanding of the cellular response to replication stress may also be applied to DNA-induced replication stalling.

The word "junk" does not appear anywhere in the paper and the word "cancer" appears only once in the text where it refers to a "cancer-associated" mutation in yeast. This makes me wonder why the press release uses both of these words so prominently. Does anybody have any ideas?

Perhaps it has something to do with a quotation from Gideon Coster, who is described as the study leader. He says,

We wanted to understand why it seems more difficult for cells to copy repetitive DNA sequences than other parts of the genome. Our study suggests that so-called junk DNA is actually playing an important and potentially damaging role in cells, by blocking DNA replication and potentially opening the door to cancerous mutations.

I find it strange that he refers to "so-called junk DNA" in the press release but didn't mention it in the peer-reviewed paper. He also didn't emphasize cancerous mutations in the paper.

The press release contain another quotation, this time it's from Kristian Helin who is the Chief Executive of The Institute of Cancer Research. He says,

This study helps to unravel the puzzle of junk DNA – showing how these repetitive sequences can block DNA replication and repair. It’s possible that this mechanism could play a role in the development of cancer as a cause of genetic instability – especially as cancer cells start dividing more quickly and so place the process of DNA replication under more stress.

It's unclear to me how studying these mutation-inducing repeats could help "unravel the puzzle of junk DNA" but that's probably why I'm not the chief executive of a cancer research insitute. I'm so stupid that I didn't even known there WAS a "puzzle" of junk DNA to be unravelled!

It's time for scientists to speak out against press releases like this one. It misrepresents the results and their interpretation as published after undergoing peer review. Intead, the press release is used as a propaganda exercise to promote the personal views of the scientists—views that they couldn't publish. This is what happened with ENCODE and it's becoming more and more common. The fact that, in this case, the personal views of these scientists are flawed only makes the situation worse.


Saturday, July 30, 2022

Wikipedia blocks any mention of junk DNA in the "Human genome" article

Wikipedia has an article on the Human genome. The introduction includes the following statement,

Human genomes include both protein-coding DNA genes and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly-repetitive sequences.

This is a recent improvement (July 22, 2022) over the original statement that simply said, "Human genomes include both protein-coding DNA genes and noncoding DNA." I noted in the "talk" section" that there was no mention of junk DNA in the entire article on the human genome so I added a sentence to the end of the section quoted above. I said,

Some non-coding DNA is junk, such as pseudogenes, but there is no firm consensus over the total mount of junk DNA.1

Thursday, July 28, 2022

Kat Arney defends junk DNA

I'm a big fan of Kat Arney and I loved her 2016 book Herding Hemingway's Cats where she interviews a number of prominent scientists. If you haven't read it you should try and get a copy even if it's just to read the chapters on Mark Ptashne, Dan Graur, and Adrian Bird. The last chapter begins with an attempt to interview Evelyn Fox Keller but don't be put off by that because the rest of the chapter is very scientific.

Kar Arney gets mentioned a couple of times in my book and I quote her opinion of epigenetics from the chapter on Adrian Bird. She has a much better understanding of genes, genomes, and junk DNA that every other person who's ever written a book on those subjects. I especially like what she has to say about her journey of discovery on page 259 near the end of the book.

Things that I thought were solid fact have been exposed as dogma and scientific hearsay, based on little evidence but repeated enough times by researchers, textbooks, and journalists until they feel real.
                                                                                Kat Arney (2016)

Kat Arney has just (July 28, 2022) posted a Genetics Society podcast on Genetics Unzipped. The main title is Does size matter when it comes to your genes and the subsections are "Where have all the genes gone?" "Genes or junk?" and "Are you more special than an onion?" You can listen to the podcast on that site (24 minutes) or read the entire transcript.

I don't entirely agree with everything she says in the podcast but she should be applauded for defending junk DNA in the face of all the scientific hearsay that out there. Good for her.

Here's three things that I might have said differently.

  • I don't agree with her historical account of estimates of the number of genes in the human genome [False History and the Number of Genes 2010]. The knowledgeable experts in the field were predicting about 30,000 genes and their estimates weren't far off. The figure below is from Hatje et al. (2019). Note the anomalous estimates from the GeneSweep lottery and the EST data. The EST data were known to be highly suspect. This is important because the false narrative promotes the idea that scientists knew very little about the human genome before the sequence was published and it promotes the idea that there's some great mystery (too few genes) that needs to be solved.
  • I disagree with her statement that "actual genes makes up less than 2% of all the DNA in the whole human genome." My disagreement depends somewhat on the definition of a gene but that's not really controversial. We're talking about the molecular gene and that's defined as "A gene is a DNA sequence that is transcribed to produce a functional product" [What Is a Gene?]. There are exceptions but this is the best definition we have. The fact that a great many scientists are confused about this is no excuse. Genes include introns so the typical human gene is quite large. In fact, about 45% of the human genome is devoted to genes. This is a far cry from the small percentage (<2%) that consists only of coding regions.
  • Kat Arney says, "So, given that most of our genome isn’t actually genes, what does the rest of it do? Well, it’s complicated, and there’s still a lot we don’t know." My quibble here is subtle but I think it's important. I think we have a pretty good handle on the functional parts of our genome and I don't expect any surprises. We know that about 10% of our genome is conserved and we can account for most of that functional DNA. The rest is not a mystery. We know that most of it consists of various flotsam and jetsam related to transposons and things like pseudogenes and dead viruses. This is junk DNA by any definition and we should stop pretending that it's a big mystery. When we say that 90% of our genome is junk that's not a reflection of ignorance; it's an evidence-based conclusion.

Hatje, K., Mühlhausen, S., Simm, D., and Kollmar, M. (2019) The Protein-Coding Human Genome: Annotating High-Hanging Fruits. BioEssays, 0(0), 1900066. [doi: 10.1002/bies.201900066]

Sunday, July 17, 2022

The Function Wars Part XIII: Ford Doolittle writes about transposons and levels of selection

It's theoretically possible that the presence of abundant transposon fragments in a genome could provide a clade with a selective advantage at the level of species sorting. Is this an important contribution to the junk DNA debate?

As I explained in Function Wars Part IX, we need to maintain a certain perspective in these debates over function. The big picture view is that 90% of the human genome qualifies as junk DNA by any reasonable criteria. There's lots of evidence to support that claim but in spite of the evidence it is not accepted by most scientists.

Most scientists think that junk DNA is almost an oxymoron since natural selection would have eliminated it by now. Many scientists think that most of our genome must be functional because it is transcribed and because it's full of transcription factor binding sites. My goal is to show that their lack of understanding of population genetics and basic biochemistry has led them astray. I am trying to correct misunderstandings and the false history of the field that have become prominent in the scientific literature.

For the most part, philosophers and their friends have a different goal. They are interested in epistemology and in defining exactly what you mean by 'function' and 'junk.' To some extent, this is nitpicking and it undermines my goal by lending support, however oblique, to opponents of junk DNA.1

As I've mentioned before, this is most obvious when it comes to the ENCODE publicity campaign of 2012 [see: Revising history and defending ENCODE]. The reason why the ENCODE researchers were wrong is that they didn't understand that many transcription factor binding sites are unimportant and they didn't understand that many transcripts could be accidental. These facts are explained in the best undergraduate textbooks and they were made clear to ENCODE researchers in 2007 when they published their preliminary results. They were wrong because they didn't understand basic biochemistry. [ENCODE 2007]

Some people are trying to excuse ENCODE on the grounds that they simply picked an inappropriate definition of function. In other words, ENCODE made an epistemology error not a stupid biochemistry mistake. Here's another example from a new paper by Ford Doolittle in Biology and Philosophy. He says,

However, almost all of these developments in evolutionary biology and philosophy passed molecular genetics and genomics by, so that publicizers of the ENCODE project’s results could claim in 2012 that 80.4% of the human genome is “functional” (Ecker et al 2012) without any well thought-out position on the meaning of ‘function’. The default assumption made by ENCODE investigators seemed to have been that detectable activities are almost always products of selection and that selection almost always serves the survival and reproductive interests of organisms. But what ENCODE interpreted as functionality was unclear—from a philosophical perspective. Charitably, ENCODE’s principle mistake could have been a too broad and level-ignorant reading of selected effect (SE) “function” (Garson 2021) rather than the conflation of SE and causal role (CR) definitions of “the F-word”, as it is often seen as being (Doolittle and Brunet 2017).

My position is that this is far too "charitable." ENCODE's mistake was not in using the wrong definition of function; their mistake was in assuming that all transcripts and all transcription factor binding sites were functional in any way. That was a stupid assumption and they should have known better. They should have learned from the criticism they got in 2007.

This is only a small part of Doolittle's paper but I wanted to get that off my chest before delving into the main points. I find it extremely annoying that there's so much ink and electrons being wasted on the function wars when the really important issues are a lack of understanding of population genetics and basic biochemistry. I fear that the function wars are contributing to the existing confusion rather than clarifying it.

Doolittle, F. (2022) All about levels: transposable elements as selfish DNAs and drivers of evolution. Biology & Philosophy 37: article number 24 [doi: 10.1007/s10539-022-09852-3]

The origin and prevalence of transposable elements (TEs) may best be understood as resulting from “selfish” evolutionary processes at the within-genome level, with relevant populations being all members of the same TE family or all potentially mobile DNAs in a species. But the maintenance of families of TEs as evolutionary drivers, if taken as a consequence of selection, might be better understood as a consequence of selection at the level of species or higher, with the relevant populations being species or ecosystems varying in their possession of TEs. In 2015, Brunet and Doolittle (Genome Biol Evol 7: 2445–2457) made the case for legitimizing (though not proving) claims for an evolutionary role for TEs by recasting such claims as being about species selection. Here I further develop this “how possibly” argument. I note that with a forgivingly broad construal of evolution by natural selection (ENS) we might come to appreciate many aspects of Life on earth as its products, and TEs as—possibly—contributors to the success of Life by selection at several levels of a biological hierarchy. Thinking broadly makes this proposition a testable (albeit extraordinarily difficult-to-test) Darwinian one.

The essence of Ford's argument builds on the idea that active transposable elements (TEs) are examples of selfish DNA that propagate in the genome. This is selection at the level of DNA. Other elements of the genome, such as genes, regulatory sequences, and origins of replication, are examples of selection at the level of the organism and individuals within a population. Ford points out that some transposon-related sequences might be co-opted to form functional regions of the genome that are under purifying selection at the level of organisms and populations. He then goes on to argue that species with large amounts of transposon-related sequences in their genomes might have an evolutionary advantage because they have more raw material to work with in evolving new functions. If this is true, then this would be an example of species level selection.

These points are summarized near the end of his paper.

Thus TE families, originating and establishing themselves abundantly within a species through selection at their own level may wind up as a few relics retained by purifying selection at the level of organisms. Moreover, if this contribution to the formation of useful relics facilitated the diversification of species or the persistence of clades, then we might also say that these TE families were once “drivers” of evolution at these higher levels, and that their possession was once an adaptation at each such higher level.

There are lots of details that we could get into later but I want to deal with the main speculation; namely, that species with lots of TE fragments in their genome might have an adaptive advantage over species that don't.

This is challenging topic because lots of people have expressed their opinions on many of the topics that Ford covers in his article. None of their opinions are identical and many of them are based on different assumptions about things like evolvability, teleology, the significance of the problem, how to define species sorting, and whether hierachy theory is important . Many of those people are very smart (as is Ford Doolittle) and it hurts my brain trying to figure out who is correct. I'll try and explain some of the issues and the controversies.

A solution in search of a problem?

What's the reason for speculating that abundant bits of junk DNA might be selected because they will benefit the species at some time in the next ten million years or so? Is there a problem that this speculation explains?

The standard practice in science is to suggest hypotheses that account for an unexplained observation; for example, the idea of abundant junk DNA explained the C-value Paradox and the mutation load problem. Models are supposed to have explanatory power—they are supposed to explain something that we don't understand.

Ford thinks there's is a reason for retaining junk DNA. He writes,

Eukaryotes are but one of the many clades emerging from the prokaryotic divergence. Although such beliefs may be impossible to support empirically it is widely held that that was a special and evolutionarily important event....

Assuming this to be true (but see Booth and Doolittle 2015) we might ask if there are reasons for this differential evolutionary success, and are these reasons clade- level properties that have been selected for at this high level? Is one of them the possession of large and variable families of TEs?

You'll have to read his entire paper to see his full explanation but this is the important part. Ford, thinks that the diversity and success of eukaryotes requires an explanation because it can't be accounted for by standard evolutionary theory. I don't see the problem so I don't see the need for an explanation.

Of course there doesn't have to be a scientific problem that needs solving. This could just be a theoretical argument showing that excess DNA could lead to species level selection. That puts it more in the realm of philosophy and Ford does make the point in his paper that one of his goals is simply to defend multilevel selection theory (MLST) as a distinct possibility. The main proponents of this idea (Hierarchy Theory) are Niles Eldredge and Stephen Jay Gould and the theory is thoroughly covered in Gould's book The Structure of Evolutionary Theory. I was surprised to discover that this book isn't mentioned in the Doolittle paper.

I don't have a problem with Hierarchy Theory (or Multilevel Selection Theory, or group selection) as a theoretical possibility. The important question, as far as I'm concerned, is whether there's any evidence to support species selection. As Ford notes, "such beliefs may be impossible to support empirically" and that may be true; however, there's a danger in promoting ideas that have no empirical support because that opens a huge can of worms that less rigorous scientists are eager to exploit.

With respect to the role of transposon-related sequences, the important question, in my opinion, is: Would life look substantially less diverse or less complex if no transposon-related sequences had ever been exapted to form elements that are now under purifying selection? I suspect that the answer is no—life would be different but no less diverse or complex.

Species selection vs species sorting

Speculations about species-level evolution are usually discussed in the context of group selection and species selection or, more broadly, as the levels-of-selection debate. Those are the terms Doolittle uses and he is very much interested in explaining junk DNA as contributing to adaptation at the species level.

But if the insertion of [transcription factor binding sites] TFBSs helps species to innovate and thus diversify (speciate and/or forestall extinction) and is a consequence of TFBS-bearing TE carriage, then such carriage might be cast as an adaptation at the level of species and maintained at that level too, by the differential extinction of TE-deficient species (Linquist et al 2020; Brunet et al 2021).

I think it's unfortunate that we don't use the term 'species sorting' instead of 'species selection' because as soon as you restrict your discussion to selection, you are falling into the adaptationist trap. Elisabeth Vrba, backed by Niles Eldredge, preferred 'species sorting' partly in order to avoid this trap.

I am convinced, on the basis of Vrba's analysis, that we naturalists have been saying 'species selection' when we really should have been calling the phenomenon 'species sorting.' Species sorting is extremely common, and underlies a great deal of evolutionary patterns, as I shall make clear in this narrative. On the other hand, true species selection, in its properly more restricted sense, I now believe to be relatively rare. (Niles Eldredge, in Reinventing Darwin (1995) p. 137)

As I understand it, the difference between 'species sorting' and 'species selection' is that the former term does not commit you to an adaptationist explanation.2 Take the Galapagos finches as an example. There has been fairly rapid radiation of these species from a small initial population that reached the islands. This radiation was not due to any intrinsic propery of the finch genome that made finches more successful at speciation; it was just a lucky accident. Similary, the fact that there are many marsupial species in Australia is probably not because the marsupial genome is better suited to evolution; it's probably just a founder effect at the species level.

Gould still prefers 'species selection' but he recognizes the problem. He points out that whenever you view species as evolving entities within a larger 'population' of other species, you must consider species drift as a distinct possibility. And this means that you can get evolution via a species-level founder effect that has nothing to do with adapation.

Low population (number of species in a clade) provides the enabling criterion for important drift ... at the species level. The analogue of genetic drift—which I shall call 'species drift' must act both frequently and powerfully in macroevolution. Most clades do not contain large numbers of species. Therefore, trends may often originate for effectively random reasons. (Stephen J. Gould, in The Structure of Eolutionary Theory (2001) p. 736)

Let's speculate how this might relate to the current debate. It's possible that the apparent diversity and complexity of large multicellular eukaryotes is mostly due to the fact that they have small populations and long generation times. This means that there were plenty of opportunities for small isolated populations to evolve distinctive features. Thus, we have, for example, more than 1000 different species of bats because of species drift (not species selection). What this means is that the evolution of new species is due to the same reason (small populations) as the evolution of junk DNA. One phenomenon (junk DNA) didn't cause the other (speciation); instead, both phenomena have the same cause.

Michael Lynch has written about this several times, but the important, and mind-hurting, paper is Lynch (2007) where he says,

Under this view, the reductions in Ng that likely accompanied both the origin of eukaryotes and the emergence of the animal and land-plant lineages may have played pivotal roles in the origin of modular gene architectures on which further develomental complexity was built.

Lynch's point is that we should not rule out nonadaptive processes (species drift) in the evolution of complexity, modularity, and evolvability.

If we used species sorting instead of species selection, it would encourage a more pluralsitic perspective and a wider variety of speculations. I don't mean to imply that this issue is ignored by Ford Doolittle, only that it doesn't get the attention it deserves.

Evolvability and teleology

Ford is invoking evolvability as the solution to the evolved complexity and diversity of multicellular eukaryotes. This is not a new idea: it is promoted by James Shapiro, by Mark Kirschner and John Gerhart, and by Günter Wagner, among others. (None of them are referenced in the Doolittle paper.)

The idea here is that clades with lots of TEs should be more successful than those with less junk DNA. It would be nice to have some data the address this question. For example, is the success of the bat clade due to more transposons than other mammals? Probably not, since bats have smaller genomes than other mammals. What about birds? There are lots of bird species but birds seem to have smaller genomes than some of their reptilian ancestors.

There are dozens of Drosophila species and they all have smaller genome sizes than many other flies. In this case, it looks like the small genome had an advantage in evolvability but that's not the prediction.

The concept of evolvability is so attractive that even a staunch gene-centric adaptationist like Richard Dawkins is willing to consider it (Dawkins, 1988). Gould devotes many pages (of course) to the subject in his big Structure book. Both Dawkins and Gould recognize that they are possibly running afoul of teleology in the sense of arguing that species have foresight. Here's how Dawkins puts it ...

It is all too easy for this kind of argument to be used loosely and unrespectably. Sydney Brenner justly ridiculed the idea of foresight in evolution, specifically the notion that a molecule, useless to a lineage of organisms in it own geological era, might nevertheless be retained in the gene pool because of its possible usefulness in some future era: "It might come in handy in the Cretaceous!" I hope I shall not be taken as saying anything like that. We certainly should have no truck with suggestions that individual animals might forego their selfish advantage because of posssible long-term benefits to their species. Evolution has no foresight. But with hindsight, those evolutionary changes in embryology that look as though they were planned with foresight are the ones that dominate successful forms of life.

I interpret this to mean that we should not be fooled by hindsight into looking for causes when what we are seeing is historical contingency. If you have not already read Wonderful Life by Stephen Jay Gould then I highly recommend that you get a copy and read it now in order to understand the role of contingency in the evolution of animals. You should also brush up on the more recent contributions to the tape-of-life debate in order to put this discussion about evolvability into the proper context [Replaying life's tape].

Ford also recognizes the teleological problem and even quotes Sydney Brenner! Here's how Ford explains the relationship between transposon-related sequences and species selection.

As I argue here, organisms took on the burden of TEs not because TE accumulation, TE activity or TE diversity are selected-for traits within any species, serving some current or future need, but because lower-level (intragenomic) selection creates and proliferates TEs as selfish elements. But also, and just possibly, species in which this has happened speciate more often or last longer and (even more speculatively still) ecosystems including such species are better at surviving through time, and especially through the periodic mass extinctions to which this planet has been subjected (Brunet and Doolittle 2015). ‘More speculatively still’ because the adaptations at higher levels invoked are almost impossible to prove empirically. So what I present are again only ‘how possibly’, not ‘how actually’ arguments (Resnick 1991).

This is diving deeply into the domain of abstract thought that's not well-connected to scientific facts. As I mentioned above, I tend to look on these speculations as solutions looking for a problem. I would like to see more evidence that the properties of genomes endow certain species with more power to diversify than species with different genomic properties. Nevertheless, the idea of evolvability is not going away so let's see if Ford's view is reasonable.

As usual, Stephen Jay Gould has thought about this deeply and come up with some useful ideas. His argument is complicated but I'll try and explain it in simple terms. I'm relying mostly on the section called "Resolving the paradox of Evolvability and Defining the Exaptive Pool" in The Structure of Evolutionary Theory pages 1270-1295.

Gould argues that in Hierarchy Theory, the properties at each level of evolution must be restricted to that level. Thus, you can't have evolution at the level of DNA impinging on evolution at the level of the organism. For example, you can't have selection between transposons within a genome affecting evolution at the level of organisms and population. Similarly, selection at the level of organisms can't directly affect species sorting.

What this means in terms of genomes full of transposon-related sequences is the following. Evolution at the level of species involves sorting (or selection) between different species or clades. Each of these species have different properties that may or may not make them more prone to speciations but those properties are equivalent to mutations, or variation, at the level of organisms. Some species may have lots of transposon sequences in their genome and some may have less and this difference arises just by chance as do mutations. There is no foresight in generating mutations and there is no foresight in having different sized genomes.

During species sorting, the differences may confer some selective advantage so species with, say, more junk DNA are more likely to speciate but the differences arose by chance in the same sense that mutations arise by chance (i.e. with no foresight). For example, in Lenski's long-term evolution experiment, certain neutral mutations became fixed by chance so that new mutations arising in this background became adaptive [Contingency, selection, and the long-term evolution experiment]. Scientists and philosophers aren't concerned about whether those neutral mutations might have arisen specifically in order to potentiate future evolution.

Similarly, it is inappropriate to say that transposons, or pervasive transcription, or splicing errors, arose BECAUSE they encouraged evolution at the species level. Instead, as Dawkins said, those features just look with hindsight as though they were planned. They are fortuitous accidents of evolution.

Gould also makes the point, again, that we could just as easily be looking at species drift as species selection and we have to be careful not to resort to adaptive just-so stories in the absence of evidence for selection.

Here's how Gould describes his view of evolvability using the term "spandrel" to describe potentiating accidents.

Thus, Darwinians have always argued that mutational raw material must be generated by a process other than organismal selection, and must be "random" (in the crucal sense of undirected towards adaptive states) with respect to realized pathways of evolutionary change. Traits that confer evolvability upon species-individuals, but arise by selection upon organisms, provide a precise analog at the species level to the classical role of mutation at the organismal level. Because these traits of species evolvability arise by a different process (organismal selection), unrelated to the selective needs of species, they may emerge as the species level as "random" raw material, potentially utilizable as traits for species selection.

The phenotypic effects of mutation are, in exactly the same manner, spandrels at the organismal level—that is, nonadaptive and automatic manifestations at a higher level of different kinds of causes acting directly at a lower level. The exaptation of a small and beneficial subset of these spandrels virtually defines the process of natural selection. Why else do we so commonly refer to the theory of natural selection as as interplay of "chance" (for the spandrels of raw material in mutational variation) and "necessity" (for the locally predictable directions of selection towards adaptation). Similarly, species selection operates by exapting emergent spandrels from causal processes acting upon organisms.

This is a difficult concept to gasp so I urge interested readers to study the relevant chapter in Gould's book. The essence of his argument is that species sorting can only be understood at the level of species as individuals and the properties of species as the random variation upon which species sorting operates.

Michael Lynch is also skeptical about evolvability but for slightly different reasons (Lynch, 2007). Lynch is characteristically blunt about how he views anyone who disagrees with him. (I have been on the losing side of one of those disagreement and I still have the scars to prove it.)

Four of the major buzzwords in biology today are complexity, modularity, evolvability, and robustness, and it is often claimed that ill-defined mechanisms not previously appreciated by evolutionary biologists must be invoked to explain the existence of emergent properties that putatively enhance the long-term success of extant taxas. This stance is not very different from the intelligent-design philosophy of invoking unknown mechanisms to explain biodiversity.

This is harsh and somewhat unfair since nobody would accuse Ford Doolittle of ulterior motives. Lynch's point is that evolvability must be subjected to the same rigorous standards that he applies to population genetics. He questions the idea that "the ability to evolve itself is actively promoted by directional selection" and raises four objections.

  1. Evolvability doesn't meet the stringent conditions that a good hypothesis demands.
  2. It's not clear that the ability to evolve is necessarily advantageous.
  3. There's no evidence that differences between species are anything other than normal variation.
  4. "... comparative genomics provides no support for the idea that genome architectural changes have been promoted in multicellular lineages so as to enhance their ability to evolve.

Why transposon-related sequences?

One of the problems that occurred to me was why there was so much emphasis on transposon sequences. Don't the same arguments apply to pseudogenes, random duplications, and, especially, genome doublings? They do, but the paper appears to be part of a series that arose out of a 2018 meeting on Evolutionary Roles of Transposable Elements: The Science and Philosophy organized by Stefan Linquist and Ford Doolittle. That's why there's a focus on transposons. I assume that Ford could make the same case for other properties of large genomes such as pervasive transcription, spurious transcription binding sites, and splicing errors even if they had nothing to do with transposons.

Is this an attempt to justify junk?

I argue that genomes are sloppy and junk DNA accumulates just because it can. There's no ulterior motive in having a large genome full of junk and it's far more likely to be slightly deleterious than neutral. I believe that all the evidence points in that direction.

This is not a popular view. Most scientists want to believe that all that of excess DNA is there for a reason. If it doesn't have a direct functional role then, at the very least, it's preserved in the present because it allows for future evolution. The arguments promoted by Ford Doolittle in this article, and by others in related articles, tend to support those faulty views about the importance of junk DNA even though that wasn't the intent. Doolittle's case is much more sophisticated than the naive views of junk DNA opponents but, nevertheless, you can be sure that this paper will be referenced frequently by those opponents.

Normal evolution is hard enough but multilevel selection is even harder, especially for molecular biologists who would never think of reading The Structure of Evolutionary Theory, or any other book on evolution. That's why we have to be really careful to distinguish between effects that are adaptations for species sorting and effects that are fortuitous and irrelevant for higher level sorting.

Function Wars
(My personal view of the meaning of function is described at the end of Part V.)

1. The same issues about function come up in the debate over alternative splicing [Alternative splicing and evolution].

2. See Vrba and Gould (1986) for a detailed discussion of species sorting and species seletion and how it pertains to the hierarchical perspective.

Dawkins, R. (1988) The Evolution of Evolvability. Artifical Life, The proceedings of an Interdisciplinary Workshp on The Synthesis and Simulation of Living Systems held September 1987 in Los Alamos, New Mexico. C. G. Langton, Addison-Wesley Publishing Company: 201-220.

Lynch, M. (2007) The frailty of adaptive hypotheses for the origins of organismal complexity. Proceedings of the National Academy of Sciences 104:8597-8604. [doi: 10.1073/pnas.0702207104

Vrba, E.S. and Gould, S.J. (1986) The hierarchical expansion of sorting and selection: sorting and selection cannot be equated. Paleobiology 12:217-228. [doi: 10.1017/S0094837300013671]