Wednesday, November 03, 2021

What's in your genome?: 2021

This is an updated version of what's in your genome based on the latest data. The simple version is ...

about 90% of your genome is junk

The more sophisticated version is...

There are several ways of estimating the amount of functional DNA and the amount of junk DNA. All of them are approximations but they only differ by a few percent. Note that several categories overlap. For example, introns and pseudogenes contain substantial amounts of DNA derived from transposons. The total amount of transposon-related sequence is about 55% when you include this fraction.

Here's a list of DNA sequences that are known or presumed to have a function (i.e. they are not junk).

  • functional parts of protein-coding genes (mostly coding regions): 0.9%
  • functional parts of genes for likely noncoding RNAs: 0.6%
  • regulatory sequences: 0.2%
  • scaffold attachment regions (SARs): 0.3%
  • origins of replication: 0.3%
  • centromeres: 1%
  • telomeres: 0.1%
  • (functional virus sequences: 0.1%)
  • (functional transposons: 0.1%)
  • conserved sequences of unknown function: ~4.6% (maximum)

This adds up to about 8% of the genome. Note that there's considerable debate over the definition of function and how it applies to virus sequences and transposon sequences that are still intact. It qualifies as junk DNA by my definition of DNA than can be deleted without affecting the survival of the organism but I want to make it clear that that all of the virus- and transposon-related sequences included in junk below are not intact and thus clearly junk by any definition.

Here's a list of DNA sequences that are known or presumed to be junk DNA.

  • pseudogenes: 5%
  • introns (including 25% of transposon sequences): 43%
  • additional defective transposon sequences: 30%
  • defective virus sequences: 9%
  • mitochondrial DNA: 0.01%
  • extra repetitive DNA: 2%

The total amount of known or presumed junk DNA adds up to 89%. That leaves another 3% unaccounted for. Some of it could be nonconserved spacer DNA that's functional or it could be additional conserved sequences since the total amount of conserved DNA could be closer to 10% according to some studies. Or it could be junk DNA.

Note that there are about 20,000 protein-coding genes and they take up about 39% of the genome (~1% exons, ~37% introns). We don't know exactly how many noncoding genes there are but a reasonable (and generous) estimate is 5,000. These gene take up an additional 7% of the genome (~1% exons, ~6% introns). (Much of the functional regions of noncoding RNA genes consists of 300 copies of ribosomal RNA genes (0.4%).) The important point is that roughly 45% of the genome is genes when we define a gene as a DNA sequence that's transcribed. A lot of this is junk within introns.

The figure below shows a region from the short arm of chromosome 12 (p13.31) in order to illustrate the gene density. A lot of people don't realize that almost half of our genome is genes.


35 comments:

  1. Are we defining "functional" for viruses and transposons as referring to function for the host or ability to function as a virus or transposon? What happens to the percentage if we count only the former?

    ReplyDelete
    Replies
    1. You know the answer to your last question. It makes very little difference.

      I edited my post to (hopefully) satisfy your concern.

      Delete
  2. Larry,
    Only 0.6% ncRNA has functions? LOL. Pure wishful thinking. Real experimental results out today prove you hopelessly wrong. It is pretty much 100%!
    https://www.sciencedirect.com/science/article/pii/S0092867421012307?dgcid=coauthor RNA promotes the formation of spatial compartments in the nucleus

    ReplyDelete
    Replies
    1. The paper talks about hundreds of noncoding RNAs. Have you bothered to do a simple calculation to estimate how much of the genome those few hundred genes would take up?

      Looking forward to your apology but not holding my breath.

      Delete
    2. From the paper as quoted below, this is what hundreds mean. Among the hundreds (~650) of ncRNAs they examined for no specific reason (randomly picked), the vast majority (93%) are located in compartments. Wouldn't anyone expect the same for the unexamined ncRNAs?

      Quinodoz et al 2021: Thousands of nuclear-enriched ncRNAs are expressed in
      mammalian cells but only a handful have been mapped on chromatin. We mapped 650 long non-coding RNAs (lncRNAs) in
      mESCs and observed a striking difference in chromatin localization between these and mature mRNAs (Figures 5A, S5A, and
      S5B; see STAR Methods). Specifically, we found that the vast
      majority (93%) of lncRNAs are strongly enriched within 3D proximity of their transcriptional loci (Figures 5B–5D and S5C).

      Delete
    3. I estimate that there are about 5,000 noncoding RNAs and that means 5,000 genes. A generous estimate of the exon size (functional part of the gene) gives 0.6% of the genome devoted to these genes.

      The paper you reference looks at the localization of many of these RNAs, such as snoRNAs and snRNAs. It also examines some longer transcripts with unknown function. They call these lncRNAs and they mapped 650 of them.

      We don't know how many of these long transcripts have a function but I estimated that there might be 1000 genuine lncRNAs. If we assume that the average length is 2000 bp then the genes would take up 0.06% of the genome or 1/10th the amount I assigned to all noncoding genes.

      What's the problem? Did you not bother to do these calculations before posting your comment?

      Surely you don't think that all of the transcripts in the nucleus are functional, do you? There's no evidence to support such a claim and plenty of evidence against it. Most of the nuclear transcripts are short-lived and present at very low concentrations. They are junk RNA produced by spurious transcription.

      Delete
    4. from the authors's preprint a year ago where there is no word count limit, they said more about not yet studied ncRNAs: "hundreds of nuclear ncRNAs are preferentially localized within precise structures in the nucleus, suggesting that this may be an important and common function exploited by additional nuclear RNAs." So, nearly all ncRNAs! NO junk DNAs. https://www.biorxiv.org/content/10.1101/2020.08.25.267435v1

      Delete
    5. Larry,
      Your number crunching may be off a great deal. A ncRNA gene could be much longer than the longest coding DNA. ncRNA expression are very cell type specific, unlike mRNA. So the total number of ncRNA expressed in a whold human body is much greater than that expressed in a single cell type. In 2018, a comprehensive integration of lncRNAs from existing databases, published literature and novel RNA assemblies based on RNA-seq data analysis, revealed that there are 270,044 lncRNA transcripts in humans.
      Ma L, Cao J, Liu L, Du Q, Li Z, Zou D, Bajic VB, and Zhang Z (Jan 2019). "LncBook: a curated knowledgebase of human long non-coding RNAs". Nucleic Acids Research. 47 (Database issue): D128–D134. doi:10.1093/nar/gky960. PMC 6323930.

      Delete
    6. Junk DNA or neutral DNA is largely a concept derived from theoretical reasoning, rather than from real data or experiments. That reasoning is known as the molecular clock and the neutral theory. As the neutral theory was inspired by the molecular clock and claims the clock as its best evidence, we only need to focus on the question if the molecular clock is not really just a mirage as some experts have claimed (F. Ayala). Our many papers in the past 10 years have shown, yes, it is indeed a mirage. We have offered in its place a more correct interpretation, the maximum genetic diversity hypothesis, which claims the exact opposite of neutral/junk DNAs. We have devised tests to distinguish these two competing theories and to see which is true, see Huang, 2012 Primate phylogeny: molecular evidence for a pongid clade excluding humans and a prosimian clade containing tarsiershttps://link.springer.com/article/10.1007/s11427-012-4350-7

      Delete
    7. Larry,
      It appears that you are claiming that only 0.6% of the human genome is transcribed as ncRNAs. That was shocking to me. Did you simply miss a major finding of ENCODE that 76% of the genome are transcribed as RNA?

      "ENCODE's results are changing how scientists think about genes. It found about 76% of the genome's DNA is transcribed into RNA of one sort or another, way more than researchers had originally expected." https://www.science.org/content/article/human-genome-much-more-just-genes#:~:text=And%20many%20bases%20are%20simply,than%20researchers%20had%20originally%20expected.

      Delete
    8. I had forgotten who this guy is. Now I recall.

      Delete
    9. Yeah, he's a kook who doesn't ever listen to any arguments or evidence that conflict with his peculiar worldview.

      I was hoping that things had changed but I see that they haven't.

      Delete
    10. And yet he manages to get published. I took a look at the publication he cited in his last post but one. Shockingly bad. He assumes, for example, that all methods of phylogenetic analysis assume and rely on a molecular clock.

      Delete
    11. Gnomon: "It appears that you are claiming that only 0.6% of the human genome is transcribed as ncRNAs."

      What rock did you just crawl out from under? You seem to have basically no familiarity with or understanding of the case for junk-DNA, much less with Larry's views.

      Delete
  3. Prof. Moran,
    Is “What’s In Your Genome” a book, or just an illustration? I can’t find it on Amazon.
    Bernard Leikind

    ReplyDelete
    Replies
    1. It's a book in progress. I hope to have it published next year.

      Delete
  4. Replies
    1. It's not supposed to be in the DNA in your nucleus (= genome).

      Delete
    2. Larry is cryptically referring to nuMTs, mitochondrial sequences that have been duplicated into the nuclear genome. (At least I think that's what he's saying). Some species have multiple tandem (and linearized) repeates of the entire mitochondrial genome as nuMTs; I don't know the nature of human nuMTs.

      Delete
    3. How much mitochondrial DNA is in your genome?

      https://sandwalk.blogspot.com/2017/11/how-much-mitochondrial-dna-in-your.html

      Delete
    4. I think the answer was once ca. 580 kb. Hazkani-Covo et al (Plos Genet 2010) report 263.478 bp (haploid, makes 537kb diploid ) and say "The number of human numts was reported with values ranging from 286 to 612 depending on the search parameters and depending on how closely related were combined hits into a single numt contig. Later calculations based on numts from both human and chimpanzee suggested an intermediate number of 452 numts [21]. Some of the human numts stem from independent insertion events from the mitochondrion, whereas others are the results of tandem
      duplications [19] or subsequent segmental duplications. Older numts appear in more copies than recent ones [22]." So its about a half a million bases depending on how you count. The most recent insertions are from Tschernobyl, I think. In cancer, it rains mtDAN on the nucleus, not for increasing function, I note.

      Delete
    5. The number I use in my book is about 600 fragments accounting for less than 0.01% of our genome. I think this is accurate enough for my purposes.

      I appreciate that not all of these NumtS are independent insertions but that doesn't affect the calculation.

      Delete
    6. Of course, the older numts are, the harder they are to detect. There's probably some kind of equilibrium with new numts appearing and old numts fading into genomic noise. We'll never know how much of the genome consists of ancient numt fragments, reaching back to the first eukaryotes.

      Delete
    7. @John

      Yes that's true. It's also true of transposons, viral DNA, segmental duplications, and polyploidy. It takes at least 100 million years for the sequences to drift apart so that they may no longer be recognizable. This is presumably the major source of unique sequence junk DNA.

      You can read all about it in my book, if it ever gets published.

      Delete
    8. Looking forward to the book, and I can't see why it wouldn't be published.

      I don't know about other taxa, but in birds, the horizon for neutrally evolving sequences no longer being alignable seems to exceed 100ma.

      Delete
  5. Larry What do you consider junk DNA?

    ReplyDelete
    Replies
    1. Junk DNA is any stretch of DNA that can be eliminated without affecting the fitness of the individual. Functional DNA is any stretch of DNA that's currently being preserved by purifying selection.

      (There are minor objections that can be raised for both definitions but you get the drift.)

      Delete
    2. So active transposons are (at least most of them) junk by that definition. One wonders, however, how this definition works for the hypothetical bulk spacer DNA.

      Delete
    3. Yes, active transposons are junk by that definition but it’s a good idea to distinguish between defective transposon sequences and ones that are still active.

      Spacer DNA is easy since it can’t be deleted without affecting fitness. Bulk DNA hypotheses pose more of a problem since you can easily delete parts of it but not all.

      There are other quibbles. Can you find them?

      Delete
    4. @John,

      I’m guessing that you just like to find all the exceptions and you don’t have any better definitions, right?

      Delete
    5. Your definition is fine. Active transposons are junk. Bulk DNA is problematic, if it even exists.

      Delete
    6. There probably should be a separate category, neither junk nor functional, for DNA whose sequence doesn't matter, only its length, and that in a very fuzzy way.

      Delete
    7. This comment has been removed by the author.

      Delete
  6. This comment has been removed by the author.

    ReplyDelete
  7. Is there a possibility that this junk DNA will not be "junk" in the future? Because it says "presumed."

    ReplyDelete