More Recent Comments

Showing posts with label Genome. Show all posts
Showing posts with label Genome. Show all posts

Thursday, January 15, 2026

Even more regulatory elements?

The expression of genes is regulated at many levels but one of the most important is regulation at the level of transcription. Transcription initiation is controlled by transcription factors that bind to sequences near the promoter and either activate or repress transcription.

A lot of work has been done on transcription regulation in mammals over the past 40 years. The general impression from these detailed studies of individual genes is that regulation usually involves a relatively small number of transcription factors that bind to sequences within 1000 bp or so of the transcription start site.

This model was challenged by the ENCODE studies in 2012. ENCODE researchers claimed to have discovered hundreds of thousands of cis-regulatory elements (CRE's) covering a substantial percentage of the genome. If they are correct, then this means that there are dozens of transcription factors controlling the expression of every gene.

Sunday, January 04, 2026

Will AlphaGenome from Google DeepMind help us understand the human genome?

I recently reported that Google's AI program does a horrible job of summarizing the junk DNA controversy. [The scary future of AI is revealed by how it deals with junk DNA] That led to a discussion about the "intelligence" in artificial intelligence and whether AI was capable of distinguishing between accurate and inaccurate data.

Google DeepMind is an artificial intelligence research laboratory headquartered in London, UK. Two of its programmers, Demis Hassabis and John Jumper, were awarded the 2024 Nobel Prize in Chemistry for developing AlphaFold, a program that predicts the tertiary structure of proteins.

Friday, December 19, 2025

How many lncRNA genes in the human genome? (2025)

There is considerable controversy over the total number of genes in the human genome. The number of protein-coding genes is pretty well established at somewhere between 19,500 and 20,000. It's the number of non-coding genes that's disputed.

There's general agreement on the number of well-defined small RNA genes such as snRNAs, snoRNA, microRNAs etc. Similarly, the number of ribosomal RNA and tRNA genes is known. The problem is with identifying genuine long non-coding RNA genes (lncRNA genes). Estimates vary from less than 20,000 to more than 200,000 but most of these estimates fail to define what they mean by "gene." Many scientists seem to think that any detectable transcript must come from a gene.

This doesn't make any sense since we know that spurious transcripts exist and they don't come from genes by any meaningful definition of gene. The only reasonable definition of a molecular gene is a DNA sequence that's transcribed to produce a functional product.1

The idea that spurious, non-functional, transcripts exist has been described in the scientific literature for many decades. One of my favorites is in a paper by Ponting and Haerty (2022) quoting another paper from thirteen years ago by Ulitsky and Bartel.

The cellular transcriptional machinery does not perfectly discriminate cryptic promoters from functional gene promoters. This machinery is abundant and so can engage sites momentarily depleted of nucleosomes and rapidly initiate transcription. The chance occurrence of splice sites can then facilitate the capping, splicing, and polyadenylation of long transcripts. A very large number of such rare RNA species are detectable in RNA-sequencing experiments whose properties are virtually indistinguishable from those of bona fide lncRNAs. Consequently, “a sensible [null] hypothesis is that most of the currently annotated long (typically >200 nt) noncoding RNAs are not functional, i.e., most impart no fitness advantage, however slight” (Ulitsky and Bartel, 2013: p. 26).

The important point here is that the correct null hypothesis is that these transcripts don't have a biologically relevant function and the burden of proof is on researchers to demonstrate function before assigning them to a genuine gene. My colleagues at the University of Toronto made the same point in a paper published in 2015.

In the absence of sufficient evidence, a given ncRNA should be provisionally labeled as non-functional. Subsequently, if the ncRNA displays features/activities beyond what one would expect for the null hypothesis, then we can reclassify the ncRNA in question as being functional. (Palazzo and Lee, 2015)

There are a number of well-defined lncRNAs that have been shown to have distinct reproducible functions. The key question is how many of these biologically relevant lncRNA genes exist in the human genome. I struggled with the answer to this question when I was writing my book. I finally decided to make a generous estimate of 5000 non-coding genes and that implies several thousand lncRNA genes (p. 127). I now think that estimate was far too generous and there are probably fewer than 1000 genuine lncRNA genes.

I have not scoured the literature for all the examples of human lncRNAs having good evidence of function but my impression is that there are only a few hundred. This post was incited by a recent publication by researchers from the Hospital for Sick Children and the University of Toronto (Toronto, Canada) who characterized another functional lncRNA called CISTR-ACT that plays a role in regulating cell size (Kiriakopulos et al., 2025).

I was prompted to revisit this controversy by the accompanying press release that said ...

Unlike genes that encode for proteins, CISTR-ACT is a long non-coding RNA (or lncRNA) and is part of the non-coding genome, the largely unexplored part that makes up 98 per cent of our DNA. This research helps show that the non-coding genome, often dismissed as ‘junk DNA’, plays an important role in how cells function.

We're used to this kind of misinformation2 in press releases but I thought it would be a good idea to read the paper. As I expected, there's nothing in the paper about junk DNA but here's the first sentence of the introduction.

The human genome contains more long non-coding RNAs (lncRNAs) than protein-coding genes (GENCODE v49) which regulate genes and chromatin scaffolding.

The latest version of GENCODE Release 49 claims that there are 35,899 lncRNA genes. This is the only reference in the Kiriakopulos et al. paper to the number of lncRNA genes. There's no mention of the controversy and none of the papers that discuss the controversy are referenced.

The GENCODE number is close to the latest version of Ensembl, which lists 35,042 lncRNA genes. I couldn't find any good explanation for these numbers or for the definition of "gene" that they are using but what's interesting is how these numbers are climbing every year; for example, a paper from two years ago listed a number of sources and you can see that the RefSeq and GENCODE numbers are much smaller than today's numbers (Amaral et al., 2023).3

We intend to provoke alternative interpretation of questionable evidence and thorough inquiry into unsubstantiated claims.

Ponting and Haerty (2022)

It's perfectly acceptable to state your preferred view on lncRNAs when you publish a paper. The authors of the recent paper may want to believe that there are more lncRNA genes than protein-coding genes but I think it's important for them to define what they mean by "gene" when they make such a claim. What's not acceptable, in my opinion, is to ignore a genuine scientific controversy by not mentioning in the introduction that there are other legitimate views.

It's a shame that they didn't do that because their paper is a good example of the hard work that needs to be done in order to demonstrate that a particular lncRNA has a biologically relevant function.

In closing, I want to emphasize the recent review by Ponting and Haerty (2022)4 that points out the importance of the problem and the kinds of experiments that need to be done in order to establish that a given RNA comes from a real gene. This is how a scientific controversy should be addressed. Here's the abstract of that paper ...

Do long noncoding RNAs (lncRNAs) contribute little or substantively to human biology? To address how lncRNA loci and their transcripts, structures, interactions, and functions contribute to human traits and disease, we adopt a genome-wide perspective. We intend to provoke alternative interpretation of questionable evidence and thorough inquiry into unsubstantiated claims. We discuss pitfalls of lncRNA experimental and computational methods as well as opposing interpretations of their results. The majority of evidence, we argue, indicates that most lncRNA transcript models reflect transcriptional noise or provide minor regulatory roles, leaving relatively few human lncRNAs that contribute centrally to human development, physiology, or behavior. These important few tend to be spliced and better conserved but lack a simple syntax relating sequence to structure and mechanism, and so resist simple categorization. This genome-wide view should help investigators prioritize individual lncRNAs based on their likely contribution to human biology.


1. See Wikipedia: Gene; What Is a Gene?; Definition of a gene (again); Must a Gene Have a Function?.

2. No knowledgeable scientist ever said that all non-coding DNA was junk. We've known about non-coding genes for more than half-a-century.

3. See How many genes in the human genome (2023)?

4. See Most lncRNAs are junk

Amaral, P., Carbonell-Sala, S., De La Vega, F.M., Faial, T., Frankish, A., Gingeras, T., Guigo, R., Harrow, J.L., Hatzigeorgiou, A.G., Johnson, R. et al. (2023) The status of the human gene catalogue. Nature 622:41-47. [doi: 10.1038/s41586-023-06490-x]

Kiriakopulos et al. (2025) LncRNA CISTR-ACT regulates cell size in human and mouse by guiding FOSL2. Nature communications: (in press). [doi: 10.1038/s41467-025-67591-x]

Palazzo, A.F. and Lee, E.S. (2015) Non-coding RNA: what is functional and what is junk? Frontiers in genetics 6:2(1-11). [doi: 10.3389/fgene.2015.00002]

Ponting, C.P. and Haerty, W. (2022) Genome-Wide Analysis of Human Long Noncoding RNAs: A Provocative Review. Annual review of genomics and human genetics 23. [doi: 10.1146/annurev-genom-112921-123710

Ulitsky, I. and Bartel, D.P. (2013) lincRNAs: genomics, evolution, and mechanisms. Cell 154:26-46. [doi: 10.1016/j.cell.2013.06.020]

Monday, November 24, 2025

Evolution explains the differences between the human and chimpanzee genomes

If you align similar regions of the human and chimpanzee genomes they turn out to be about 98.6% identical in nucleotide sequence. The total number of differences amount to 44 million base pairs (bp). If the differences are due to mutations that have occurred since divergence from a common ancestor, then there would be 22 million mutations in each lineage.

The mutation rate is approximately 100 new mutations per generation. Most of these will be neutral mutations that have no effect on the survival of the individual and almost all of them will be lost within a few generations. A small number of these neutral mutations will become fixed in the population and it's these fixed mutations that produce most of the changes in the genome of evolving populations. According to the neutral theory of population genetics, the number of fixed neutral mutations corresponds to the mutation rate. Thus, in every evolving population there will be 100 new fixed mutations per generation.

Thursday, September 25, 2025

Wednesday talk at the University of Toronto: Larry Moran on "What's in Your Genome"

I'm giving a talk next Wednesday (October 1st) to the members of the Senior College (retired faculty). It's at the University of Toronto Faculty Club at 10am. I'll talk for 50 mins then there's a coffee break followed by 50 mins of questions and discussion.

Guests are welcome but you'll have to pay $10 to cover the cost of coffee and cookies. You can also register to watch my talk on Zoom. You can also stay for lunch at the Faculty CLub but you'll have to let me know so I can put you down as a guest.

Here's the link to register: What's in Your Genome?

 

Wednesday Talk: Wednesday, October 1, 2025, 10am-12pm.

In-person at the Faculty Club and on Zoom

Larry Moran, Biochemistry, University of Toronto

Title: “What’s in Your Genome?”

Abstract: Scientists have been studying the human genome for more than 70 years but today there is considerable controversy about what’s in our genome. The publication of the complete sequence of the human genome in 2001 did nothing to resolve the controversy. For many scientists, the data confirmed their predictions that we have about 30,000 genes and most of our genome is useless junk DNA. Other scientists were shocked to learn that we have so few genes so they began the search for other explanations. Today, the majority of molecular biologists and biochemists believe that most of our genome is functional and there may be as many as 100,000 extra genes that weren’t identified in 2001. The majority of experts in molecular evolution disagree —they believe that 90% of our genome is junk DNA. I will summarize the data from both sides of the controversy and discuss the role that science journalism has played in misrepresenting scientific discoveries about the human genome.


Tuesday, May 06, 2025

L'ADN poubelle: Junk DNA

This is a podcast in French on the topic of junk DNA. The moderator is Thomas C. Durand of La Tronche en Biais, a YouTube channel that focuses on critical thinking. Durand interviews two scientists from l’Université Paris Cité (City University of Paris), Didier Casane and Patrick Laurenti.

It's a two hour video that discusses all the relevant topics on the human genome and junk DNA. The most exciting part for me comes at 56 mins when the moderator asks Casane and Laurenti to recommend a book on the subject (see screenshot on right). Patrick Laurenti suggests that my book should be translated into French but I don't think that's going to happen.


Saturday, February 15, 2025

Junk DNA is gradually making its way into mainstream textbooks

The idea that most of the human genome is junk originated more that 50 years ago. Since then, evidence in support of this concept has steadily accumulated but it has been stongly resisted by most biochemists and molecular biologists. Opposition is even stronger among scientists in other fields and in the general public thanks to a steady stream of anti-junk articles in the popular press.

Much of this opposition to junk DNA stems from a massive publiciy campaign launched by ENCODE researchers and the leading science journals back in 2012.

It's likely that most of the controversy over junk DNA is related to differing views on evolution and the power of natural selection. Most people think that natural selection is very powerful so that modern species must be extremely well-adapted to their present environment. They tend to believe that complexity is simply a reflection of sophisticated fine-tuning and this must apply to the human genome. According to this view, the presence of huge amounts of DNA with an unknown function is just a temporary situation and in the next few years most of this 'dark matter' will turn out to have a function. It has to have a function otherwise natural selection would have eliminated it.

Saturday, October 26, 2024

Three lungfish species have huge genomes

Lungfish are our closest living fish cousins. All living terrestrial vertebrates (e.g. amphibians, mammals, reptiles) descent from a common ancestor with lungfish. The split occurred about 400 million years ago (4Ma) (Devonian) when there were 70-100 different lungfish species.

This relationship (lungfish-tetrapods) was firmly established recently by comparing the genome of the Australian lungfish (Neoceratodus forsteri) with that of tetrapods (Meyer et al., 2021). The other possibility had been ceolacanth-tetrapods. Coelacanths and lungfish are related—they form the class Sarcopterygii (lobe-finned fish).

Monday, October 21, 2024

Philip Ball strikes back

Philip Ball believes that we are in the middle of a revolution in our way of thinking about how life works. His ideas are complex but part of his case involves molecular biology and how things work at the molecular level. Ball believes that the old view of molecular biology placed far too much emphasis on coding DNA and ignored all the other functional regions of genomes. He also says that most of our genes specify non-coding RNA instead of mRNA and implies to his readers that a very large fraction of our genome is functional (i.e. not junk).1

In order to build the case for revolution, he tries to demonstrate a paradigm shift in our view of molecular biology by showing a huge gap between the understanding of previous generations of molecular biologists and the post-genomic view. I believe he is wrong about this for two reasons: first, he misrepresents the views of older molecular biologists and, second he misrepresents the discoveries of the past twenty years. I tried to explain why he was wrong about these two claims in a previous post where I discussed an article he published in Scientific American in May 2024: Philip Ball says RNA may rule our genome.

Philip Ball responded to my criticism in a comment under that article.

Older molecular biologists were really stupid

I said ...

Ball begins with the same old myth that writers like him have been repeating for many years. He claims that before ENCODE most molecular biologists were really stupid. According to Philip Ball, most of us thought that coding DNA was the only functional part of the genome and most of the rest was junk DNA.

In the comment section of my earlier post, Philip Ball says,

I’m sorry to say that Larry’s commentary here is dismayingly inaccurate.

Let’s get this one out of the way first:

“He claims that before ENCODE most molecular biologists were really stupid.”

I have never made this claim and never would – it is a pure fabrication on Larry’s part. I guess this is what John Horgan meant in his comment to Larry: credible writers don’t just make up stuff.

I admit that Philip Ball never said those exact words. I'll leave it to the readers to decide whether my characterization of his position is accurate.

I stand by the statements I made although I admit to a bit of hyperbole. Ball has said repeatedly that the molecular biologists of my generation were wedded to the idea that coding regions were the only important part of the genome and he often connects that to the Central Dogma of Molecular Biology. He also claims that the experts in molecular biology dismissed all non-coding DNA as junk. Here's how he puts it in another article that he published recently in Aeon: We are not machines.

Only around 1-2 per cent of the entire human genome actually consists of protein-coding genes. The remainder was long thought to be mostly junk: meaningless sequences accumulated over the course of evolution. But at least some of that non-coding genome is now known to be involved in regulating genes: altering, activating or suppressing their transcription in RNA and translation into proteins.

I interpret that to mean that older molecular biologists, like me, didn't know about functional non-coding DNAs such as centromeres, telomeres, origins of replication, non-coding genes, SARs, and regulatory sequences in spite of the fact that thousands of papers on these sequences were published in the 30 years that preceded the publication of the first draft of the human genome sequence. This is not true, we did know about those things. I don't think it's too much of an exaggeration to say that Philip Ball thinks we were really stupid.

Here's what he says in his book, "How Life Works" (p. 85) when he's talking about the beginning of the human genome project.

Even at its outset, it faced the somewhat troubling issue that just 2 percent or so of our genome actually accounts for protein-coding genes. The conventional narrative was that our biology was all about proteins, for each of which the genome held the template. ... But we had all this other DNA too! What was it for? The common view was that it was mostly just junk, like the stuff in our attics: meaningless material accumulated during evolution, which our cells had no motivation to clear out.

Again, his claim is that in 1990 at the beginning of the human genome project the experts in molecular biology thought that non-coding DNA was mostly junk (98% of the genome). I have repeatedly refuted this myth and challenged anyone to come up with a single scientific paper arguing that all non-coding DNA is junk. I challenge Philip Ball to find a single molecular biology textbook written before 1990 that fails to discuss regulation, non-coding genes, and other non-coding functional elements in the human genome.

The truth is that the molecular biology experts concluded in the 1970s that we had about 30,000 genes and that 90% of our genome is junk and 10% is functional. That 10% consisted of about 2% coding DNA (now thought to be only 1%) and 8% functional non-coding DNA. So the "conventional narrative" was that there was a lot more functional non-coding DNA than coding DNA.

The human genome is full of genes for regulatory RNAs.

"Ball is one of the most meticulous, precise science writers out there. He is the antithesis of hypey, "dumb-it-down" reporting. He is MUCH more credible than you are, Laurence."

John Horgan July, 2024
The title of the article I was discussing is "Revolutionary Genetics Research Shows RNA May Rule Our Genome." In that article Ball says that ENCODE was basically right and there are many more non-coding genes than protein-coding genes. I pointed out that Ball mentions some criticism of this idea but only to dismiss it. I said that "[Ball] wants you to believe that almost of all of those transcripts are functional—that's the revolution that he's promoting." Philip Ball objects to this statement ...

This too is sheer fabrication. I don’t say this in my article, nor in my book. Instead, I say pretty much what Larry seems to want me to say, but for some reason he will not admit it – which is that there is controversy about how many of the transcripts are functional."

Ball states that "ENCODE was basically right" when they claimed that 75% of our genome was transcribed and he goes on to say that ...

Dozens of other research groups, scoping out activity along the human genome, also have found that much of our DNA is churning out 'noncoding' RNA.

He says that ENCODE has identified 37,000 noncoding genes but there may be as many as 96,000. After making these definitive statements, he mentions that there are "still doubters" but then discuss why these discoveries are revolutionary. Later on he quotes John Mattick suspecting that there may be more that 500,000 non-coding genes.

Toward the end of the article, after discussing all kinds of functional RNAs, he brings up the Ponting and Haerty review where they say that most lncRNAs are just noise. He also mentions that the low copy number of non-coding RNAs raises questions about whether they are functional but immediately counters with the standard excuses from his allies.

Ball closes the article with ...

Gingeras says he is perplexed by ongoing claims that ncRNAs are merely noise or junk, as evidence is mounting that they do many things. "It is puzzling why there is such an effort to persuade colleagues to move from a sense of interest and curiosity in the ncRNA field to a more dubious and critical one," he says.

Perhaps the arguments are so intense because they undercut the way we think our biology works. Ever since the epochal discovery about DNA's double helix and how it encodes information, the bedrock idea of molecular biology has been that there are precisely encoded instructions that program specific molecules for particular tasks. But ncRNAs seem to point to a fuzzier, more collective, logic to life. It is a logic that is harder to discern and harder to understand. ut if scientists can learn to live with the fuzziness, this view of life may turn out to be more complete.

What's remarkable about the quote from a leading ENCODE worker (Gingeras) is that he is "puzzled" by scientists who are dubious and critical about claims in the ncRNA field. Isn't that what good scientists are supposed to do? Isn't that exactly what we did when we successfully challenged the dubious claims about junk DNA made in 2012?

There is no doubt in my mind that Philip Ball has fallen hook-line-and-sinker for the ENCODE claims that our genome is buzzing with non-coding genes. He only brings up the counter-arguments to dismiss them and pretend that he is being fair. Nobody who was truly skeptical about the function of transcripts would write an article with the title, "Revolutionary Genetics Research Shows RNA May Rule Our Genome."

However, as Ball points out in other comments, he does have a sentence in his book where he mentions that perhaps only 30% of the genome is functional. He says in the comment that what he believes is that the amount of functional DNA lies somewhere between 10% and 30%. That's not something that he mentions in the Scientific American article but, if he's being honest, it does mean that I was unfair when I said he believes that "almost of all of those transcripts are functional" but I only know that from what he now says, not from the published article.

If I were to take Philip Ball at his word—as expressed in the comment—then he must believe that most of the ENCODE transcripts are junk RNA. That's not a belief that you get from reading his published work.2 Furthermore, if I were to take him at his word, then he must believe that there are some reasonable criteria that must be applied to a transcript in order to decide whether it has a biologically relevant function. So, when he says that ENCODE identified 37,600 non-coding genes he must have these criteria in mind but he doesn't express any serious skepticism about that number. We all know that there's no solid evidence that such a large number of transcripts are functional but that doesn't bother Philip Ball. He thinks we are in the middle of an RNA revolution.


1. In commenting to my previous post, Ball says he believes that somewhere between 70% and 90% of our genome is junk but he doesn't say this in the Scientific American article. Instead, he says that scientists were surprised to learn that 75% of the human genome is transcribed implying that there's a lot of function. He goes on the say that "ENCODE was basically right." But what the ENCODE publicity campaign actually said was that junk DNA is dead and there's practically no junk DNA. If Ball really believes that up to 90% of the genome is junk then to me this means that ENCODE was spectacularly wrong not "basically right."

2. Ball says that 75% of the genome is transcribed. If Ball believes that as little as 10% may be functional then he must believe that less than 10% is transcribed to produce functional RNAs since he has to allow for regulatory sequences and other functional DNA elements. Let's say that 8% is a reasonable number. Ball seems to be willing to admit that 67% of the genome might be transcribed to produce junk RNA.

Friday, October 11, 2024

Philip Ball says RNA may rule our genome

Philip Ball is on a roll. He has published a new book plus several articles in popular magazines and he has appeared in a bunch of podcasts and YouTube videos. The message is all the same, he claims that it's time for a revolution in biology.

Ball's ideas are complicated and I won't go into all of them in this article. Instead, I want to focus on one of his more scientific claims; namely, the claim that genomic data has overthrown the fundamental principles of molecular biology. Let's look at his recent (May 14, 2024) article in Scientific American: Revolutionary Genetics Research Shows RNA May Rule Our Genome.1

The subtile of the article is "Scientists have recently discovered thousands of active RNA molecules that can control the human body" and that's the issue that I want to discuss here.

Monday, August 12, 2024

Zach Hancock explains junk DNA

Zach Hancock is a postdoc in ecology & evoluvionary biology at the University of Michigan. He has a YouTube channel with several thousand subscribers. You might recall that he interviewed me last year when my book came out [Zach Hancock interviews me on his YouTube channel].

He has just posted a new video on junk DNA that's well worth watching. He tries to correct all the falsehoods and misinformation on junk DNA, especially those promoted by creationists. It's well worth watching.


Wednesday, June 05, 2024

Tom Cech writes about the "dark matter" of the genome

Tom Cech won a Nobel Prize for discovering one example of a catalytic RNA. He recently published an article in the New York Times extolling the virtues of RNA and non-coding genes [The Long-Overlooked Molecule That Will Define a Generation of Science]. There's a fair amount of hype in the article but the main point is quite valid—over the past fifty years we have learned about dozens of important non-coding RNAs that we didn't know about at the beginning of molecular biology [see: Non-coding RNA, Non-coding DNA].

The main issue in this field concerns the number of non-coding genes in the human genome. I cover the available data in my book and conclude that there are fewer than 1000 (p.214). Those scientists who promote the importance of RNA (e.g. Tom Cech) would like you to believe that there are many more non-coding genes; indeed, most of those scientists believe that there are more non-coding genes than coding genes (i.e. > 20,000). They rarely present evidence for such a claim beyond noting that much of our genome is transcribed.

Tom Cech is wise enough to avoid publishing an estimate of the number of non-coding genes but his bias is evident in the following paragraph from near the end of his article.

Although most scientists now agree on RNA's bright promise, we are still only beginning to unlock its potential. Consider, for instance, that some 75 percent of the human genome consists of dark matter that is copied into RNAs of unknown function. While some researchers have dismissed this dark matter as junk or noise, I expect it will be the source of even more exciting breakthroughs.

Let's dissect this to see where the bias lies. The first thing you note is the use of the term "dark matter" to make it sound like there's a lot of mysterious DNA in our genome. This is not true. We know a heck of a lot about our genome, including the fact that it's full of junk DNA. Only 10% of the genome is under purifying selection and assumed to be functional. The rest is full of introns, pseudogenes, and various classes of repetitive sequences made up mostly of degraded transposons and viruses. The entire genome has been sequenced—there's not much mystery there. I don't know why anyone refers to this as "dark matter" unless they have a hidden agenda.

The second thing you notice is the statement that 75% of the genome is transcribed at some time or another and, according to Tom Cech, these transcripts have an unknown function. That's strange since protein-coding genes take up roughly 40% of our genome and we know a great deal about coding DNA, UTRs, and introns. If you add in the known examples of non-coding genes, this accounts for an additional 2-3% of the genome.1

Almost all the rest of the transcripts come from non-conserved DNA and those transcripts are present at less than one copy per cell. As the ENCODE researchers noted in 2014, they are likely to be junk RNA resulting from spurious transcription. I'd say we know a great deal about the fraction of the genome that's transcribed and there's not much indication that it's hiding a plethora of undiscovered functional RNAs.


Photo credit: University of Colorado, Boulder.

1. In my book I make a generous estimate of 5,000 non-coding genes in order to avoid quibbling over a smaller number and in order to demonstrate that even with such a obvious over-estimate the genome is still 90% junk.

Saturday, March 23, 2024

More genomes, more variation

The "All of Us Research Program" is an American effort to sequence one million genomes. The stated goal is to study human genetic variants and link them to genetic diseases. The study is complimentary to similar studies in Great Britain, Iceland, and Japan but the American team hopes to include more diversity in their study by recruiting people from different ethnic backgrounds.

All of Us published the results from almost 250,000 genome sequences in a recent issue of Nature (All of Us Research Program Investigators, 2024). They found one billion variants of which 275 million had not been seen before.

Recall that the UK study (UK Biobank) emphasized the importance of variation in determining whether a given region of DNA was functional or not. They noted that regions that were constrained (i.e. fewer variants) were likely under purifying selection whereas regions that accumulated variants were likely junk [Identifying functional DNA (and junk) by purifying selection]. Their results indicated that only about 10% of the genome was constrained and that's consistent with the view that 90% of our genome is junk. The American study did not address this issue so we don't know how it related to the junk DNA controversy.

Note that if 90% of our genome is junk then that represents 2.8 billion base pairs and the potential for more than 8 billion variants in the human population.1 Some of these will be quite frequent in different groups just by chance but most of them will be quite rare. We'll have to wait and see how this all pans out when more genomes are sequenced. The idea of increasing the detection of unusual variants by sequencing more diverse populations is a good one but the real key is just more genome sequences.

One of the things you can do with this data is to cluster the variants according to the self-identified ethnic group of the participants and All of Us didn't hesitate to do this. They even identified the clusters as races, proving once again that there are clear genetic diffences between these groups, just as you would expect. Given the sensitive nature of this fact, you would also expect a lot of criticism on the internet and that's what happened.


1. I'm defining a "variant" as a difference from the reference genome sequence. I'm aware of the terminology issue but it's not important here. There will also be a large number of variants in the functional regions.

All of Us Research Program Investigators (2024) Genomic data in the All of Us Research Program. Nature 627:340. [doi: 10.1038/s41586-023-06957-x].

Saturday, December 16, 2023

What is the "dark matter of the genome"?

The phrase "dark matter of the genome" is used by scientists who are skeptical of junk DNA so they want to convey the impression that most of the genome consists of important DNA whose function is just waiting to be discovered. Not surprisingly, the term is often used by researchers who are looking for funding and investors to support their efforts to use the latest technology to discover this mysterious function that has eluded other scientists for over 50 years.

The term "dark matter" is often applied to the human genome but what does it mean? We get a clue from a BBC article published by David Cox last April: The mystery of the human genome's dark matter. He begins the article by saying,

Twenty years ago, an enormous scientific effort revealed that the human genome contains 20,000 protein-coding genes, but they account for just 2% of our DNA. The rest of was written off as junk – but we are now realising it has a crucial role to play.

Sunday, October 15, 2023

Only 10.7% of the human genome is conserved

The Zoonomia project aligned the genome sequences of 240 mammalian species and determined that only 10.7% of the human genome is conserved. This is consistent with the idea that about 90% of our genome is junk.

The April 28, 2023 issue of science contains eleven papers reporting the results of a massive study comparing the genomes of 240 mammalian species. The issue also contains a couple of "Perspectives" that comment on the work.

Tuesday, October 10, 2023

How many genes in the human genome (2023)?

The latest summary of the number of genes in the human genome gets the number of protein-coding genes correct but their estimate of the number of known non-coding genes is far too high.

In order to have a meaningful discussion about molecular genes, we have to agree on the definition of a molecular gene. I support the following definition (see What Is a Gene?).

Thursday, May 11, 2023

Chapter 7: Gene Families and the Birth & Death of Genes

This chapter describes gene families in the human genome. I explain how new genes are born by gene duplication and how they die by deletion or by becoming pseudogenes. Our genome is littered with pseudogenes: how do they evolve and are they all junk? What are the consequences of whole genome duplications and what does it teach us about junk DNA? How many real ORFan genes are there and why do some people think there are more? Finally, you will learn why dachshunds have short legs and what "The Bridge on the River Kwai" has to do with the accuracy of the human genome sequence.

Click on this link to see more.

Gene Families and the Birth and Death of Genes