Sandwalk: What's In Your Genome?

Tuesday, March 27, 2018

What's In Your Genome? - The Pie Chart

Here's my latest compilation of the composition of the human genome. It's depicted in the form of a pie chart.¹ [UPDATED: March 29, 2018]

There are several ways of estimating the amount of functional DNA and the amount of junk DNA. All of them are approximations but they only differ by a few percent. Note that several categories overlap. For example, introns and pseudogenes contain substantial amounts of DNA derived from transposons. The total amount of transposon-related sequence is about 60% when you include this fraction.

Here's the list of DNA sequences that are known or presumed to have a function (i.e. they are not junk).

functional parts of protein-coding genes (mostly coding regions): 1%
functional parts of genes for likely noncoding RNAs: 1%
regulatory sequences: 0.2%
scaffold attachment regions (SARs): 0.3%
origins of replication: 0.3%
centromeres: 1%
telomeres: 0.1%
functional virus sequences: 0.1%
functional transposons: 0.1%
conserved sequences of unknown function: ~3.9% (maximum)

This adds up to 8% of the genome. The remaining 92% is probably junk but the available evidence is consistent with another 2-5% being functional.

Most of the junk consists of: (1) very obvious examples of broken genes (pseudogenes 5%); (2) bits and pieces of transposon sequences that used to be capable of transposing but have mutated over time (45%); and (3) ancient viral sequences that have degenerated (9%). That's 59% of the genome that's clearly junk DNA.² In addition, there's plenty of evidence that most intron sequences are dispensable. That accounts for another 28% of the genome.³ The total amount of junk DNA is at least 87%.

Note that protein-coding genes take up about 23% of the genome (1% exons, 22% introns). Genes for functional noncoding RNAs take up an additional 7% of the genome (1% exons, 6% introns). (Much of the functional region of noncoding RNA genes consists of 300 copies of ribosomal RNA genes (0.4%).) The important point is that roughly 30% of the genome is genes when we define a gene as a DNA sequence that's transcribed. A lot of this is junk within introns.

Also keep in mind that the well-characterized functional parts of the genome account for about 4% of the total but the functional regions of genes are only half of this total. Thus, we know that genes take up less than half of the total functional DNA in the human genome. This fact is not widely known even though the data is half-a-century old. I guess it takes some scientists a long time to learn the facts about the human genome.

Required reading for the junk DNA debate
Five Things You Should Know if You Want to Participate in the Junk DNA Debate

1. I have to use a pie chart because they were invented by my wife's ancestor, William Playfair.

2. I'm not ruling out the idea that some of these broken genes and fragments of genes might secondarily have acquired a new function. There are some clear examples of this and they are included in the functional categories. However, the vast majority of this DNA must be just as it appears - junk DNA.

3. The evidence for most of human intron sequences being junk is very compelling [Are introns mostly junk?].

37 comments:

Don CatesTuesday, March 27, 2018 4:55:00 PM
Pie charts should not be 3D. The 3rd D is uninformative and can be deceptive.
Otherwise, good info.
ReplyDelete
Replies
Peter CathcartTuesday, March 27, 2018 7:49:00 PM
Non-expert here, retired GP. Love the pie. Reinforces what one reads in (Sandwalk-recommended) Kat Arney's book 'Herding Hemingway's Cats.' Sober demonstration that some 96% of a gene's sequence...promotor region aside...is tossed-out introns.

Approve of 3D pies, and the deeper the better. Await your book release.
ReplyDelete
Replies
John HarshmanTuesday, March 27, 2018 10:32:00 PM
I will admit that I didn't know that RNA genes had introns. What percentage of them?
ReplyDelete
Replies
Joe FelsensteinWednesday, March 28, 2018 7:42:00 AM
Two-dimensional pies would not be worth eating.
ReplyDelete
Replies
Donald ForsdykeWednesday, March 28, 2018 8:54:00 AM
Once upon a time they thought that intron locations separated protein domains (as they do in a minority of proteins). When they were found in non-coding parts of RNAs that hypothesis lost favour.
ReplyDelete
Replies
AnonymousWednesday, March 28, 2018 9:48:00 PM
I recommend making the wedge for introns in protein-coding gene a little different shade, probably paler and/or greener. The pie chart is good. I don't care one way or the other about the 3-D component.
ReplyDelete
Replies
SnakeThursday, March 29, 2018 4:53:00 AM

Hi Larry,

This is a great figure. Is it OK if I use it in my undergrad lectures? Appropriate credit will be given, of course.

Simon
ReplyDelete
Replies
UnknownThursday, March 29, 2018 11:57:00 AM
How do you count defective transposons in transcribed pre-mRNA (introns, UTRs especially)? Current human genome assembly is annotated as 40% transcribed to coding pre-mRNAs, not 20%. About 50% of that sequence is detectably derived from mobile elements. Does your "intron" category mean "intron sequence that isn't already counted in another category"?

(That is, more generally: your categories aren't exclusive, so they shouldn't add up to 100%.)
ReplyDelete
Replies
UnknownThursday, March 29, 2018 1:47:00 PM
I think it's important to show in this fig that ~40% of the genome is transcribed to coding pre-mRNAs. One of the main (misleading) arguments about pervasive transcription goes like "only 1% of the genome is coding, but most of the genome is transcribed". Many people are surprised to learn how much of the genome is covered by annotated coding pre-mRNA transcription units.
ReplyDelete
Replies
PausaniasFriday, March 30, 2018 10:55:00 AM
The exons and Numts labels look switched around.
ReplyDelete
Replies
PausaniasFriday, March 30, 2018 10:57:00 AM
Nvm, that line between exons and introns looks like the pie slice line and Numts is so small it's probably invisible between exons and unknown.
ReplyDelete
Replies
GaboThursday, April 05, 2018 4:58:00 PM
Where can we get those numbers/proportions for referencing?
ReplyDelete
Replies
UnknownFriday, June 22, 2018 12:48:00 AM
Brilliant summary, thank you.
ReplyDelete
Replies
Nesslig20Sunday, March 31, 2019 4:49:00 PM
Hello, good summary. Where can I get the citations for the numbers on this?
ReplyDelete
Replies
Luís TAMonday, April 22, 2019 4:17:00 PM
It seems a great article, but in order to be taken seriously it should include some citations :)
ReplyDelete
Replies
Henry NormanSaturday, June 08, 2019 10:37:00 PM
First, the pie chart tells me that ≈ 9% is “unknown” (is this “junk DNA”?)

Then, a “list of DNA sequences that are known or presumed to have a function (i.e. they are not junk)” is presented ... and concluded with “This adds up to 8% of the genome. The remaining 92% is junk.”

How should I read this?

How does the “list of known DNA sequences” (and the percentages shown) relate to the pie chart?

Is it “92% junk” or “9% junk”?
ReplyDelete
Replies
AnonymousThursday, April 06, 2023 3:53:00 AM
Lander has referred when decoding the genome to it being a ‘parts list’. He commented upon the need for the ‘operating manual’.

This is surely the most succinct comment made by a genomics researcher. Directly, or indirectly, the genes are collectively express proteins but also other biological moieties. I use the plural ‘genes’ because there are very few cases where a single gene expresses a single protein which explains why genetic engineering is rarely 100% effective. It is not that the genomic screening is wrong (although I am astonished that such a complex technique is considered to be so) it is that elements of chemistry are being overlooked ie the shape/conformity, energetics and reactivity are being overlooked. There is more to the genome than just its chemical structure.

Everything has to conform to the laws of chemistry and physics.
ReplyDelete
Replies

Add comment