More Recent Comments

Showing posts sorted by date for query ENCODE. Sort by relevance Show all posts
Showing posts sorted by date for query ENCODE. Sort by relevance Show all posts

Sunday, May 10, 2026

Why do scientists at "elite" universities dominate scientific discourse?

We all know that scientists at elite universities publish a lot more papers than scientists at other universities. Why is that? Is it because those universities have better labs and equipment? Is it because the scientists at elite universities are smarter than other scientists? Is it because of the reputation of the universities that makes it easier to get papers accepted in the best journals?

A group of scientists at the University of Colorado (Boulder, Colorado, USA) decided to examine the question and they came up with another answer—one that I have long suspected.

Saturday, May 09, 2026

Pervasive transcription = genes + noise

Most of the DNA in the human genome is transcribed at some point in development or in some cell type. This fact has been known since the late 1960s.

There are basically two types of transcripts. Functional transcripts mostly come from genes although there might be a few exceptions (e.g. enhancer RNAs). Non-functional transcripts can be produced by pseudogenes or from virus and transposon fossils. They can also due to transcriptional noise caused by spurious transcription.

Monday, April 27, 2026

Ask Gemini: "What is the difference between junk DNA and non-coding DNA?"

This is weird. I was a bit bored so I asked Gemini the following question: "What is the difference between junk DNA and non-coding DNA?" I thought the first answer was so wrong that I decided to ask it again to see if I got the same answer.

The second answer was quite different because Gemini noticed that I had bookmarked Sandwalk, a blog written by Laurence Moran, a champion of the 'junk DNA' concept. Is it trying to give me the answer it thinks I want or the best possible scientific answer?

Note: Here is the correct answer.

Non-coding DNA refers to the part of the genome that doesn't code for proteins. It's one way to partition the genome - you could also refer to regulatory sequences and non-regulatory sequences.

By the late 1960s scientists knew of lots of functional non-coding DNA such as regulatory sequences and non-coding genes such as those for ribosomal RNA and tRNA. (There are other non-coding functional elements.) It became apparent that most of the human genome consisted of non-functional DNA or junk DNA. The original model back then was that 10% is functional and 90% is junk. The 10% that is functional consisted of 1-2% coding DNA and about 8% of functional non-coding DNA.

No knowledgeable scientist ever said that all non-coding DNA was junk; that's a lie that continues to be perpetuated in scientific publications and the popular media even though it has been repeatedly debunked.

Most of the data that has accumulated over the past 50+ years has supported the idea that 90% of the human genome is junk and only 10% is functional.

The Gemini answers relate to the debate concerning whether AI is really intelligent and, more importantly, whether the popular (free) algorithms are spreading misinformation.

Tuesday, April 14, 2026

How many pseudogenes in the human genome?

There are somewhat less than 25,000 genes in the human genome and there are probably about the same number of pseudogenes.

Pseudogenes are sequences that resemble real functional genes but they contain mutations that render them non-functional. They are very real examples of junk DNA.

There are four kinds of pseudogenes. Duplicated pseudogenes arise from a gene duplication event when one of the original copies mutates. Duplicated pseudogenes retain all of the features of the original gene, including introns and adjacent regulatory sequences. The inactivating mutation may occur in the gene itself—for example in the coding region of a protein coding gene—in which case the pseudogene may still be transcribed. Duplicated pseudogenes are usually found adjacent to their parent gene.

Processed pseudogenes arise when the normal transcript is copied by reverse transcriptase and the DNA copy is reintegrated into the genome. Processed pseudogenes don't have introns or regulatory sequences and they are not near their parent gene. Most processed pseudogenes come from transcripts that are expressed in the germ line.

Monday, April 06, 2026

How can philosophy contribute to science?

I've written quite a bit about the perceived conflict between science and philosophy and defended my view that science is best described in broad terms as a way of knowing that requires evidence, skepticism, and rational thinking. As far as I know, there is no other way of knowing that has produced true knowledge.

In this sense, the proper practice of philosophy has to involve science—and by that I mean evidence— if the results are going to produce knowledge. There's lots to debate on this topic, including discussions about the meaning of "knowledge" [Is science the only way of knowing?].

But that's not what I want to talk about today. Today's topic is about the contribution that philosophers can make to science. I'll focus on philosophers of biology and on scientific topics that I'm knowledgeable about and I'll assume that most philosophers agree with Elisabeth Lloyd when she says, "As a philosopher of science, I have always been oriented towards addressing problems that scientists have, not so much problems that philosophers have. That is how to do good philosophy of science."1

Now, let me be clear about the issue. It is blindingly obvious that philosophers could use their deep understanding of logic and argumentation to make significant contributions to biology, especially in cases where scientists are misusing logic. The question is not whether philosophy is incapable of ever contributing to biology but whether it is actually fulfilling that potential.

Monday, January 26, 2026

The Third Way Evolution Conference

The Third Way of Evolution is a strange organization composed of mavericks who think they're not getting enough attention. Here's how they describe their movement.

The vast majority of people believe that there are only two alternative ways to explain the origins of biological diversity. One way is Creationism that depends upon intervention by a divine Creator. That is clearly unscientific because it brings an arbitrary supernatural force into the evolution process. The commonly accepted alternative is Neo-Darwinism, which is clearly naturalistic science but ignores much contemporary molecular evidence and invokes a set of unsupported assumptions about the accidental nature of hereditary variation. Neo-Darwinism ignores important rapid evolutionary processes such as symbiogenesis, horizontal DNA transfer, action of mobile DNA and epigenetic modifications. Moreover, some Neo-Darwinists have elevated Natural Selection into a unique creative force that solves all the difficult evolutionary problems without a real empirical basis. Many scientists today see the need for a deeper and more complete exploration of all aspects of the evolutionary process.

Thursday, January 15, 2026

Even more regulatory elements?

The expression of genes is regulated at many levels but one of the most important is regulation at the level of transcription. Transcription initiation is controlled by transcription factors that bind to sequences near the promoter and either activate or repress transcription.

A lot of work has been done on transcription regulation in mammals over the past 40 years. The general impression from these detailed studies of individual genes is that regulation usually involves a relatively small number of transcription factors that bind to sequences within 1000 bp or so of the transcription start site.

This model was challenged by the ENCODE studies in 2012. ENCODE researchers claimed to have discovered hundreds of thousands of cis-regulatory elements (CRE's) covering a substantial percentage of the genome. If they are correct, then this means that there are dozens of transcription factors controlling the expression of every gene.

Sunday, January 04, 2026

Will AlphaGenome from Google DeepMind help us understand the human genome?

I recently reported that Google's AI program does a horrible job of summarizing the junk DNA controversy. [The scary future of AI is revealed by how it deals with junk DNA] That led to a discussion about the "intelligence" in artificial intelligence and whether AI was capable of distinguishing between accurate and inaccurate data.

Google DeepMind is an artificial intelligence research laboratory headquartered in London, UK. Two of its programmers, Demis Hassabis and John Jumper, were awarded the 2024 Nobel Prize in Chemistry for developing AlphaFold, a program that predicts the tertiary structure of proteins.

Wednesday, December 31, 2025

The activity of "random" DNA supports the junk DNA model

I complain a lot about the quality of science writing but today's post is very different. I want to highlight an article by Michael Le Page that he just published in New Scientist. It's one of the best articles on junk DNA that I've ever seen in popular science magazines and newspapers [Human-plant hybrid cells reveal truth about dark DNA in our genome].

I've admired Michael Le Page for many years because of his articles on climate change and evolution. It doesn't surprise me that he's right about junk DNA.

Friday, December 19, 2025

How many lncRNA genes in the human genome? (2025)

There is considerable controversy over the total number of genes in the human genome. The number of protein-coding genes is pretty well established at somewhere between 19,500 and 20,000. It's the number of non-coding genes that's disputed.

There's general agreement on the number of well-defined small RNA genes such as snRNAs, snoRNA, microRNAs etc. Similarly, the number of ribosomal RNA and tRNA genes is known. The problem is with identifying genuine long non-coding RNA genes (lncRNA genes). Estimates vary from less than 20,000 to more than 200,000 but most of these estimates fail to define what they mean by "gene." Many scientists seem to think that any detectable transcript must come from a gene.

This doesn't make any sense since we know that spurious transcripts exist and they don't come from genes by any meaningful definition of gene. The only reasonable definition of a molecular gene is a DNA sequence that's transcribed to produce a functional product.1

The idea that spurious, non-functional, transcripts exist has been described in the scientific literature for many decades. One of my favorites is in a paper by Ponting and Haerty (2022) quoting another paper from thirteen years ago by Ulitsky and Bartel.

The cellular transcriptional machinery does not perfectly discriminate cryptic promoters from functional gene promoters. This machinery is abundant and so can engage sites momentarily depleted of nucleosomes and rapidly initiate transcription. The chance occurrence of splice sites can then facilitate the capping, splicing, and polyadenylation of long transcripts. A very large number of such rare RNA species are detectable in RNA-sequencing experiments whose properties are virtually indistinguishable from those of bona fide lncRNAs. Consequently, “a sensible [null] hypothesis is that most of the currently annotated long (typically >200 nt) noncoding RNAs are not functional, i.e., most impart no fitness advantage, however slight” (Ulitsky and Bartel, 2013: p. 26).

The important point here is that the correct null hypothesis is that these transcripts don't have a biologically relevant function and the burden of proof is on researchers to demonstrate function before assigning them to a genuine gene. My colleagues at the University of Toronto made the same point in a paper published in 2015.

In the absence of sufficient evidence, a given ncRNA should be provisionally labeled as non-functional. Subsequently, if the ncRNA displays features/activities beyond what one would expect for the null hypothesis, then we can reclassify the ncRNA in question as being functional. (Palazzo and Lee, 2015)

There are a number of well-defined lncRNAs that have been shown to have distinct reproducible functions. The key question is how many of these biologically relevant lncRNA genes exist in the human genome. I struggled with the answer to this question when I was writing my book. I finally decided to make a generous estimate of 5000 non-coding genes and that implies several thousand lncRNA genes (p. 127). I now think that estimate was far too generous and there are probably fewer than 1000 genuine lncRNA genes.

I have not scoured the literature for all the examples of human lncRNAs having good evidence of function but my impression is that there are only a few hundred. This post was incited by a recent publication by researchers from the Hospital for Sick Children and the University of Toronto (Toronto, Canada) who characterized another functional lncRNA called CISTR-ACT that plays a role in regulating cell size (Kiriakopulos et al., 2025).

I was prompted to revisit this controversy by the accompanying press release that said ...

Unlike genes that encode for proteins, CISTR-ACT is a long non-coding RNA (or lncRNA) and is part of the non-coding genome, the largely unexplored part that makes up 98 per cent of our DNA. This research helps show that the non-coding genome, often dismissed as ‘junk DNA’, plays an important role in how cells function.

We're used to this kind of misinformation2 in press releases but I thought it would be a good idea to read the paper. As I expected, there's nothing in the paper about junk DNA but here's the first sentence of the introduction.

The human genome contains more long non-coding RNAs (lncRNAs) than protein-coding genes (GENCODE v49) which regulate genes and chromatin scaffolding.

The latest version of GENCODE Release 49 claims that there are 35,899 lncRNA genes. This is the only reference in the Kiriakopulos et al. paper to the number of lncRNA genes. There's no mention of the controversy and none of the papers that discuss the controversy are referenced.

The GENCODE number is close to the latest version of Ensembl, which lists 35,042 lncRNA genes. I couldn't find any good explanation for these numbers or for the definition of "gene" that they are using but what's interesting is how these numbers are climbing every year; for example, a paper from two years ago listed a number of sources and you can see that the RefSeq and GENCODE numbers are much smaller than today's numbers (Amaral et al., 2023).3

We intend to provoke alternative interpretation of questionable evidence and thorough inquiry into unsubstantiated claims.

Ponting and Haerty (2022)

It's perfectly acceptable to state your preferred view on lncRNAs when you publish a paper. The authors of the recent paper may want to believe that there are more lncRNA genes than protein-coding genes but I think it's important for them to define what they mean by "gene" when they make such a claim. What's not acceptable, in my opinion, is to ignore a genuine scientific controversy by not mentioning in the introduction that there are other legitimate views.

It's a shame that they didn't do that because their paper is a good example of the hard work that needs to be done in order to demonstrate that a particular lncRNA has a biologically relevant function.

In closing, I want to emphasize the recent review by Ponting and Haerty (2022)4 that points out the importance of the problem and the kinds of experiments that need to be done in order to establish that a given RNA comes from a real gene. This is how a scientific controversy should be addressed. Here's the abstract of that paper ...

Do long noncoding RNAs (lncRNAs) contribute little or substantively to human biology? To address how lncRNA loci and their transcripts, structures, interactions, and functions contribute to human traits and disease, we adopt a genome-wide perspective. We intend to provoke alternative interpretation of questionable evidence and thorough inquiry into unsubstantiated claims. We discuss pitfalls of lncRNA experimental and computational methods as well as opposing interpretations of their results. The majority of evidence, we argue, indicates that most lncRNA transcript models reflect transcriptional noise or provide minor regulatory roles, leaving relatively few human lncRNAs that contribute centrally to human development, physiology, or behavior. These important few tend to be spliced and better conserved but lack a simple syntax relating sequence to structure and mechanism, and so resist simple categorization. This genome-wide view should help investigators prioritize individual lncRNAs based on their likely contribution to human biology.


1. See Wikipedia: Gene; What Is a Gene?; Definition of a gene (again); Must a Gene Have a Function?.

2. No knowledgeable scientist ever said that all non-coding DNA was junk. We've known about non-coding genes for more than half-a-century.

3. See How many genes in the human genome (2023)?

4. See Most lncRNAs are junk

Amaral, P., Carbonell-Sala, S., De La Vega, F.M., Faial, T., Frankish, A., Gingeras, T., Guigo, R., Harrow, J.L., Hatzigeorgiou, A.G., Johnson, R. et al. (2023) The status of the human gene catalogue. Nature 622:41-47. [doi: 10.1038/s41586-023-06490-x]

Kiriakopulos et al. (2025) LncRNA CISTR-ACT regulates cell size in human and mouse by guiding FOSL2. Nature communications: (in press). [doi: 10.1038/s41467-025-67591-x]

Palazzo, A.F. and Lee, E.S. (2015) Non-coding RNA: what is functional and what is junk? Frontiers in genetics 6:2(1-11). [doi: 10.3389/fgene.2015.00002]

Ponting, C.P. and Haerty, W. (2022) Genome-Wide Analysis of Human Long Noncoding RNAs: A Provocative Review. Annual review of genomics and human genetics 23. [doi: 10.1146/annurev-genom-112921-123710

Ulitsky, I. and Bartel, D.P. (2013) lincRNAs: genomics, evolution, and mechanisms. Cell 154:26-46. [doi: 10.1016/j.cell.2013.06.020]

Thursday, December 11, 2025

How many regulatory sites in the human genome?

The current best model of the human genome is that only 10% is functional and 90% is junk. This model was first developed over half a century ago (see Junk DNA). From the very beginning, the model recognized that regulatory sequences would make up a significant proportion of the functional elements but early suggestions that most of the repetitive DNA would turn out to be involved in regulation were rejected.

As more and more data accumulated on regulatory sequences, it became apparent that most regulatory sequences of pol II (RNA polymerase II) genes could be found in relatively short regions of DNA just upstream of the transcription start site. It also became apparent that for each transcription factor there were thousands of transcription factor binding sites even though only a small number were actually involved in genuine gene regulation.1

Saturday, April 12, 2025

Templeton Foundation funds a grant on transposons

The John Templeton Foundation supports "interdisciplinary research and catalyze conversations that enable people to pursue lives of meaning and purpose." Many of these projects have religious themes or religious implications. The foundation is well-known for its support of projects that promote the compatibility of science and religion. You can see a list of recent grants here.

Templeton recently awarded a grant of $607,686 (US) to study the role of transposons in the human genome. The project leader is Stefan Linquist, a philosopher from the University of Guelph (Guelph, Ontario, Canada). Stefan has published a number of papers on junk DNA and he promotes the definition of functional DNA as DNA that is subject to purifying selection [The function wars are over]. Other members of the team include Ryan Gregory and Ford Doolittle who are prominent supporters of junk DNA.

Friday, March 21, 2025

The misinformation spread by ENCODE in 2012 is gradually being recognized

I want to draw your attention to an excellent online book on bacterial genomes: Bacterial Genomes:Trees and Networks. The author is Aswin Sai Narain Seshasayee of the National Centre for Biological Sciences at the Tata Institute of Fundamental Research in Bangalore, India. Here's a link to Chapter 3: The genome: how much DNA? where he explains why bacterial genomes don't have very much junk DNA.

The chapter contains an excellent summary of the history of genome sizes in bacteria and eukaryotes and a detailed description of both the c-value paradox and the mutation load arguments. The relationship between junk DNA and population size is described.

I was especially pleased to see that the author didn't pull any punches in describing the ENCODE publicity campaign and their false statements about junk DNA.

In 2012, a post-human-genome project called ENCODE, which aims to experimentally identify regions of the human genome that undergo transcription—or are bound by a set of DNA-binding proteins, or undergo chemical changes called epigenetic modifications—came to a stunning conclusion that at least 80% of the human genome is functional and that it was time to sing a requiem for the concept of junk DNA! However, this conclusion, which has been severely criticised since its publication, ignores decades of well-supported arguments from evolutionary biology arising from the c-value paradox, some of which we have described here or will do so shortly; it does not quite explain why this conclusion—if broadly applied to the genomes of other multicellular eukaryotes—would not imply that a fish needs 100 times as much functional DNA as a human; and plays “fast and loose” with the definition of the term ‘function’. While the ENCODE project, a great success in many ways, has provided an invaluable resource for the study of human molecular biology, we can safely ignore its ill-fated conclusion on what fraction of the human genome is functional.


Saturday, February 15, 2025

Junk DNA is gradually making its way into mainstream textbooks

The idea that most of the human genome is junk originated more that 50 years ago. Since then, evidence in support of this concept has steadily accumulated but it has been stongly resisted by most biochemists and molecular biologists. Opposition is even stronger among scientists in other fields and in the general public thanks to a steady stream of anti-junk articles in the popular press.

Much of this opposition to junk DNA stems from a massive publiciy campaign launched by ENCODE researchers and the leading science journals back in 2012.

It's likely that most of the controversy over junk DNA is related to differing views on evolution and the power of natural selection. Most people think that natural selection is very powerful so that modern species must be extremely well-adapted to their present environment. They tend to believe that complexity is simply a reflection of sophisticated fine-tuning and this must apply to the human genome. According to this view, the presence of huge amounts of DNA with an unknown function is just a temporary situation and in the next few years most of this 'dark matter' will turn out to have a function. It has to have a function otherwise natural selection would have eliminated it.

Wednesday, February 05, 2025

Why Trust Science?

Bruce Alberts,1 Karen Hopkin, and Keith Roberts have published an essay on Why Trust Science.

In this essay, we address the question of why we can trust science—and how we can identify which scientific claims we can trust. We begin by explaining how scientists work together, as part of a larger scientific community, to generate knowledge that is reliable. We describe how the scientific process builds a consensus, and how new evidence can change the ways that scientists—and, ultimately, the rest of us—see the world. Last, but not least, we explain how, as informed citizens, we can all become “competent outsiders” who are equipped to evaluate scientific claims and are able to separate science facts from science fiction.

Most of the essay describes an idealized version of how science works with an emphasis on collaboration and rigorous oversight. They claim that the work of scientists can usually be trusted because it is self-correcting.

Thursday, January 16, 2025

Intelligent Design Creationists launch a new attack on junk DNA (are they getting worried?)

The Center for Science and Culture (sic) and the Discovery Institute (sic) have published another propaganda video on junk DNA. The emphasis is on their claim that ID predicted a functional genome and that prediction turned out to be correct! The difference between this video an previous attempts to rationalize their failures is that I now get a personal mention and a caricature in this latest video.

I think I understand the problem. The ID creationists are getting worried about junk DNA as they realize that more and more scientists are beginning to understand the real problems with the ENCODE data and previous claims of function. This is why they are attempting to rebut the science behind junk DNA. But the real problem is that they simply don't understand the science as you can see in the video.

Once again, we are faced with a question about whether Intelligent Design Creationists are stupid or lying (or both).


Thursday, November 14, 2024

Science journal tries to understand misinformation

The November 1, 2024 issue of Science contains three articles on misinformation in science. The articles tend to concentrate on the standard examples such as vaccine misinformation but there's another kind of misinformation that's just as important. I'm talking about scientific misinformation that's spread by journals like Science and Nature.

Do any of you remember the arsenic affair? That's when science accepted a paper by Felisa Wolfe-Simon and her collaborators claiming that they isolated a bacterium that substituted arsenic for phosphorus in its DNA. The paper was published online and was severely criticized after a ridiculous NASA press conference. It was eventually refuted when Rosie Redfield and others looked closely at the bacterial DNA and showed that it did not contain arsenic. The paper has still not been retracted. [See Reviewing the "Arseniclife" Paper.]

And let's not forget the massive misinformation campaign associated with the publication of ENCODE results in 2012.

Sunday, November 10, 2024

Do plants have junk DNA?

Current Opinion in Plant Biology has a special edition devoted to Genome studies and molecular genetics 2024. The only paper (so far) that discusses plant genomes is one devoted to RNAs. Here's the abstract ...

Anyatama, A., Datta, T., Dwivedi, S. and Trivedi, P.K. (2024) Transcriptional junk: Waste or a key regulator in diverse biological processes? Current Opinion in Plant Biology 82:102639. [doi: 10.1016/j.pbi.2024.102639]

Plant genomes, through their evolutionary journey, have developed a complex composition that includes not only protein-coding sequences but also a significant amount of non-coding DNA, repetitive sequences, and transposable elements, traditionally labeled as “junk DNA”. RNA molecules from these regions, labeled as “transcriptional junk,” include non-coding RNAs, alternatively spliced transcripts, untranslated regions (UTRs), and short open reading frames (sORFs). However, recent research shows that this genetic material plays crucial roles in gene regulation, affecting plant growth, development, hormonal balance, and responses to stresses. Additionally, some of these regulatory regions encode small proteins, such as miRNA-encoded peptides (miPEPs) and microProteins (miPs), which interact with DNA or nuclear proteins, leading to chromatin remodeling and modulation of gene expression. This review aims to consolidate our understanding of the diverse roles that these so-called “transcriptional junk” regions play in regulating various physiological processes in plants.

Thursday, October 31, 2024

Philip Ball's view of alternative splicing

Genomics is a powerful tool that allows you to collect massive amounts of data that can point the way to new understanding. But it can also be abused when the results are overinterpreted. We saw an extraordinary example of this in 2012 when ENCODE made unsubstantiated claims that were quickly challenged.

I'm reminded of the caution from Sydney Brenner who warned us about genomics (Brenner, 2000) and the warning in Dan Graur's harsh critique of the 2012 ENCODE claims (Graur et al., 2013) where they said ...

The Editor-in-Chief of Science, [Bruce Alberts,] has recently expressed concern about the future of "small science," given that ENCODE-style Big Science grabs the headlines that decision makers so dearly love. Actually the main function of Big Science is to generate massive amounts of easily accessible data. The road from data to wisdom is quite long and convoluted. Insight, understanding, and scientific progress are generally achieved by "small science." ...