Saturday, May 09, 2026

Pervasive transcription = genes + noise

Most of the DNA in the human genome is transcribed at some point in development or in some cell type. This fact has been known since the late 1960s.

There are basically two types of transcripts. Functional transcripts mostly come from genes although there might be a few exceptions (e.g. enhancer RNAs). Non-functional transcripts can be produced by pseudogenes or from virus and transposon fossils. They can also due to transcriptional noise caused by spurious transcription.

It's surprisingly difficult to calculate the amount of the genome occupied by genes. One problem is that there are many different estimates of the number of non-coding genes. (The number of protein-coding genes is known to be close to 20,000.) Another problem is that the sizes of introns and exons aren't known with any precision and that's partly due to problems with annotation.

Nevertheless, we can make a reasonable estimate and that's what I did in my book in chapter 6: "How Many Genes? How Many Proteins?" I took the best estimates of coding DNA, UTRs, and introns and calculated that the average size of a protein-coding gene was 61,760 bp (61.8 Kb). Assuming that there are 19,500 of these genes, they should occupy 39% of the genome. Adding in some reasonable (and generous) estimate of non-coding genes brings the total to about 45% of the genome.1

Thus, 45% of the genome should be transcribed to produce functional RNAs. We don't know exactly how many pseudogenes are still transcribed and we don't know how many defective transposons and virus DNA are still transcribed but it's safe to say that it might represent 5% of the genome. That means we can account for 50% of the genome being transcribed to produce easily detectable RNAs. That alone, accounts for pervasive transcription.

However, many studies report that at least 75% of the genome is transcribed and the highest estimates are over 80%. The other transcripts are almost always present at very low concentrations and are rapidly degraded. The DNA sequences that are transcribed are not conserved. This strongly suggests that these are spurious transcripts. We know that such transcripts exist and we know that random sequences of DNA will be transcribed by accident giving rise to junk RNA.

A newly published paper has attempted to sort out the amount of DNA that produces functional RNAs and the amount that produces junk RNA (noise).

Adey, B.N., Maddock, D.J., Hermann-Le Denmat, S., Dinger, M.E., Gardner, P.P., Poole, A.M. and Ganley, A.R.D. (2026) Pervasive transcription in the human genome exceeds background noise. Genome Biology and Evolution:evag042. [doi: 10.1093/gbe/evag042]

Large genomes such as the human genome are pervasively transcribed yet encode relatively few unambiguously functional elements. This has led to debate over whether pervasive transcription is indicative of large suites of uncharacterized functional elements or is simply background noise. Here we used a deep-learning model to estimate background transcription in the human genome as a way of distinguishing between these two hypotheses. We applied the model to randomised (reversed or shuffled) versions of the human genome and found that transcription is predicted to be sparse across all randomisation methods, initiating with at least four-fold lower frequencies than in the native human genome. This relatively low level of background transcription from the human genome suggests that most transcription is not a consequence of background noise, thus it requires other explanations. We find that randomizing only interspersed repeats in human genome has little impact on predicted transcription, suggesting that transcription of mobile elements does not explain the excess transcription in the human genome. Instead, most transcriptional events may derive from functional noncoding RNA transcripts, some general requirement for extensive transcription initiation/elongation, and/or mutational biases leading to the frequent appearance of transcription initiation sites by chance.

The authors assume that 75% of the genome is transcribed and they are interested in knowing how much of this is due to large numbers of potential non-coding genes. They clam that their results are "an important step in resolving the long-standing debate over what proportion of the human genome is functional." They developed models to estimate the extent of background noise (spurious transcription) from random DNA sequences and their model leads them to conclude that spurious transcription cannot account for most of that 75%.

This isn't news. As I've shown above, transcription of known genes accounts for about 45% of the genome or most of pervasive transcription. What's interesting is that the authors don't discuss introns and don't seem to recognize that a lot of pervasive transcription is from introns. In fact, the authors conclude that gene transcription cannot account for a large fraction of pervasive transcription.

In summary, we found that transcription initiates in in silico randomized human genomes at only about one-quarter of the level observed in the native human genome. This disparity is unlikely to be explained solely by the presence of genes in the native genome, as the number of TSS clusters in the T2T forward genome (742,037 clusters with a maximum cluster gap of 25 bp) is vastly greater than the number of genes. On the basis of these results, we conclude that the level of background transcription initiation activity in the human genome is relatively low and does not account for the majority of the transcription in the human genome.

What's going on here? We already knew that spurious transcription (background transcription) accounts for a lot of transcribed DNA but genes account for most of the transcribed genome. In my opinion, there's nothing new in this paper that's worth publishing. Am I completely off base in my calculations or has the peer review system failed us again?


1. Keep in mind that there are hundreds of ribosomal RNA genes and they are quite large.

6 comments:

  1. "I took the best estimates of coding DNA, _URLs_, and introns and calculated that the average size of a protein-coding gene was 61,760 bp (61.8 Kb)."
    URLs should be UTRs (untranslated regions)?

    ReplyDelete
  2. "the sizes of introns and exons aren't known with any precision"!!!???
    Shouldn't the sizes of exons have the highest priority instead of investigating transcriptional noise in the genome?

    ReplyDelete
  3. The paper uses software called Puffin-D to predict transcription start sites (TSSs) in the human genome, the reversed genome, and other transformations and randomisatons of the human genome. Puffin-D predicts over 2 million TSSs in the human genome, about a quarter as many in the reversed genome, and for example, only 101 in a purely random one. So TSSs are very pervasive, but AFAICS, there is no information in the paper about:

    1. how often the TSSs are actually transcribed
    2. how long the transcripts are - they could stop a few bases later or be monsters like dystrophin
    3. how much the transcripts overlap

    The paper says almost nothing about the percentage of DNA that is transcribed, let lone about how much of that is functional. The stuff about ENCODE, 75%, junk DNA and dark DNA seems to be clickbait.

    Puffin-D and and the related Puffin appeared in Science in 2024, and a preprint is at https://www.biorxiv.org/content/10.1101/2023.06.27.546584v1.full.pdf. This is not my area, but this looks like a good paper to me. Puffin-D is a blackbox prediction engine. Puffin provides some interpretation.

    That paper says Puffin identifies the same 'key sequence patterns' in the mouse genome as the human genome. The picture that comes to my mind is that in our ancestral genome there has been something like 30,000 TSSs which have been preserved by natural selection for a very very long time, and these have been copied all over the place. Sometimes they have been reversed. Some of the patterns are bidrectional so they'll appear in the reversed genome anyway. Does that make sense? What have I missed? And how many TSSs do onions have?



    ReplyDelete
  4. @Graham Jones: Here are the opening sentences in the introduction.

    "A major question in genome biology is the extent to which large genomes like the human genome are made up of nonfunctional ‘junk’ DNA versus as-yet uncharacterized functional ‘dark’ DNA . This question has generated vigorous debate, and a key point of contention is how to interpret observations of pervasive transcription (Eddy 2013; Graur, et al. 2013; Kellis, et al. 2014; Doolittle and Brunet 2017; Jandura and Krause 2017; Walter 2024). Pervasive transcription refers to the widespread transcription of genomes including the regions that do not harbor any known functional elements (Kapranov, et al. 2007). For example, the ENCODE project reported that at least 75% of the human genome is transcribed (Djebali, et al. 2012), even though only ~2% of the genome encodes proteins (Piovesan, et al. 2019)."

    Note that pervasive transcription is defined as "widepsread transcription of genomes" (up to 75%). It does NOT refer to the number of transcription start sites (TSS).

    You are correct to note that the authors do not actually discuss the total amount of DNA that is transcribed in spite of the fact that they introduce this topic in the introduction to their paper and the fact that that the title of the paper is "Pervasive transcription in the human genome exceeds background noise."

    That seems disingenuous to me.

    You are also correct to note that the authors identify more than 2 million transcription start sites. Given that there are probably no more than 25,000 genes, don't you think this merits some kind of comment?

    The most obvious explanation is that the vast majority of those TSSs are not biologically relevant, don't you think? But how does that that explanation jibe with their conclusion that most transcription is NOT due to background noise?

    This is not a good paper. Their Puffin-D results lead them to a conclusion that doesn't make any sense and they fail to address the obvious problems with their data. They propose two kinds of solution in the Discussion; the first is some sort of adaptive explanation of all that transcription and the second is something about mutation bias.

    Here's a sentence from their conclusion, tell me if you agree with this conclusion.

    "On the basis of these results, we conclude that the level of background transcription initiation activity in the human genome is relatively low and does not account for the majority of the transcription in the human genome."

    Do you agree that of the 2 million transcription initiation sites, the majority are NOT due to background noise?

    ReplyDelete
  5. I'm a bit puzzled by your response. I more or less agree with your assessment of the Adey et al paper, but you seem to think I don't.

    I suspect the main issue is that my area is phylogenetic analysis. For example, you regard neutrally evolving sequences as *noise*, because they don't contribute to fitness, but those sequences provide the cleanest phylogenetic *signal*. I'm interested in different things to you.

    It may also help to clarify that my last two paragraphs were both about the Dudnyk paper, not the Adey one.

    You quoted from the introduction to the Adey paper. That is exactly what I was referring to when I said "The stuff about ENCODE, 75%, junk DNA and dark DNA seems to be clickbait." You say 'disingenuous', I say 'clickbait'. It's not worth arguing about that difference. We're agreed it's not a good paper.

    You asked "Do you agree that of the 2 million transcription initiation sites, the majority are NOT due to background noise?"

    I think that only a few percent of the 2 million TSSs are subject to purifying selection. I also think their presence deserves explanation. They are a signal of some (mainly neutral) evolutionary processes and I'd like to know what those are.

    ReplyDelete