Sandwalk: What's In Your Genome?

Tuesday, March 27, 2018

What's In Your Genome? - The Pie Chart

Here's my latest compilation of the composition of the human genome. It's depicted in the form of a pie chart.¹ [UPDATED: March 29, 2018]

There are several ways of estimating the amount of functional DNA and the amount of junk DNA. All of them are approximations but they only differ by a few percent. Note that several categories overlap. For example, introns and pseudogenes contain substantial amounts of DNA derived from transposons. The total amount of transposon-related sequence is about 60% when you include this fraction.

Here's the list of DNA sequences that are known or presumed to have a function (i.e. they are not junk).

functional parts of protein-coding genes (mostly coding regions): 1%
functional parts of genes for likely noncoding RNAs: 1%
regulatory sequences: 0.2%
scaffold attachment regions (SARs): 0.3%
origins of replication: 0.3%
centromeres: 1%
telomeres: 0.1%
functional virus sequences: 0.1%
functional transposons: 0.1%
conserved sequences of unknown function: ~3.9% (maximum)

This adds up to 8% of the genome. The remaining 92% is probably junk but the available evidence is consistent with another 2-5% being functional.

Most of the junk consists of: (1) very obvious examples of broken genes (pseudogenes 5%); (2) bits and pieces of transposon sequences that used to be capable of transposing but have mutated over time (45%); and (3) ancient viral sequences that have degenerated (9%). That's 59% of the genome that's clearly junk DNA.² In addition, there's plenty of evidence that most intron sequences are dispensable. That accounts for another 28% of the genome.³ The total amount of junk DNA is at least 87%.

Note that protein-coding genes take up about 23% of the genome (1% exons, 22% introns). Genes for functional noncoding RNAs take up an additional 7% of the genome (1% exons, 6% introns). (Much of the functional region of noncoding RNA genes consists of 300 copies of ribosomal RNA genes (0.4%).) The important point is that roughly 30% of the genome is genes when we define a gene as a DNA sequence that's transcribed. A lot of this is junk within introns.

Also keep in mind that the well-characterized functional parts of the genome account for about 4% of the total but the functional regions of genes are only half of this total. Thus, we know that genes take up less than half of the total functional DNA in the human genome. This fact is not widely known even though the data is half-a-century old. I guess it takes some scientists a long time to learn the facts about the human genome.

Required reading for the junk DNA debate
Five Things You Should Know if You Want to Participate in the Junk DNA Debate

1. I have to use a pie chart because they were invented by my wife's ancestor, William Playfair.

2. I'm not ruling out the idea that some of these broken genes and fragments of genes might secondarily have acquired a new function. There are some clear examples of this and they are included in the functional categories. However, the vast majority of this DNA must be just as it appears - junk DNA.

3. The evidence for most of human intron sequences being junk is very compelling [Are introns mostly junk?].

37 comments :

Don Cates said...: Pie charts should not be 3D. The 3rd D is uninformative and can be deceptive.
Otherwise, good info.; Tuesday, March 27, 2018 4:55:00 PM
Larry Moran said...: Are you familiar with the work "picky"? :-)

I'm trying to make an attractive figure in order to get a point across to non-experts. The exact percentages and the area of the wedges aren't important. This is for a book chapter called "The Big Picture."

P.S. Real pies are three-dimensional. :-); Tuesday, March 27, 2018 5:17:00 PM
Peter Cathcart said...: Non-expert here, retired GP. Love the pie. Reinforces what one reads in (Sandwalk-recommended) Kat Arney's book 'Herding Hemingway's Cats.' Sober demonstration that some 96% of a gene's sequence...promotor region aside...is tossed-out introns.

Approve of 3D pies, and the deeper the better. Await your book release.; Tuesday, March 27, 2018 7:49:00 PM
John Harshman said...: I will admit that I didn't know that RNA genes had introns. What percentage of them?; Tuesday, March 27, 2018 10:32:00 PM
Larry Moran said...: Several tRNA genes have introns. Some ribosomal RNA genes have introns. Many of the genes for small RNAs are extensively processed, including intron splicing. Most lncRNA genes have large introns.; Tuesday, March 27, 2018 10:39:00 PM
Don Cates said...: I was very picky in my "work"(B-). Made a lot of graphs for scientific papers. Most 3D graphs, where the 3rd D is meaningless are deceptive (mostly non-intentionally) in one way or another. (one of0 My other buttons is the use of colour as the only or primary way of comparing data sets. With some colour choices(often the default choices), 5-15% of your audiencedon't get the message.
I blame PowerPoint (fuck Microsoft).

and... your pie isn't a real one.(B-); Wednesday, March 28, 2018 12:38:00 AM
Don Cates said...: Ha... I bet you also like 3D bar charts with the sloping baseline, multi-coloured grouped bars, and a ribbon (3D line!) hovering over it all. pfthtt!

Oh, I'm pickin'. (B-)
(Oh, I've had cataract surgery.. make that (;-); Wednesday, March 28, 2018 1:35:00 AM
Joe Felsenstein said...: Two-dimensional pies would not be worth eating.; Wednesday, March 28, 2018 7:42:00 AM
Donald Forsdyke said...: Once upon a time they thought that intron locations separated protein domains (as they do in a minority of proteins). When they were found in non-coding parts of RNAs that hypothesis lost favour.; Wednesday, March 28, 2018 8:54:00 AM
Unknown said...: The minimum number of dimensions for a pie to be worth eating is 5 (see Kaluza(1921) and Klein(1926) for details).

Kaluza, Theodor (1921). "Zum Unitätsproblem in der Physik". Sitzungsber. Preuss. Akad. Wiss. Berlin. (Math. Phys.): 966–972.
Klein, Oskar (1926). "Quantentheorie und fünfdimensionale Relativitätstheorie". Zeitschrift für Physik A. 37 (12): 895–906.; Wednesday, March 28, 2018 10:43:00 AM
John Harshman said...: Wouldn't the tendency for introns to begin/end out of frame be a good clue?; Wednesday, March 28, 2018 4:16:00 PM
Donald Forsdyke said...: You would think so. But the protein domain hypothesis persisted for some time, with its proponents turning cartwheels to sustain it.; Wednesday, March 28, 2018 6:33:00 PM
Anonymous said...: I recommend making the wedge for introns in protein-coding gene a little different shade, probably paler and/or greener. The pie chart is good. I don't care one way or the other about the 3-D component.; Wednesday, March 28, 2018 9:48:00 PM
Snake said...: Hi Larry,

This is a great figure. Is it OK if I use it in my undergrad lectures? Appropriate credit will be given, of course.

Simon; Thursday, March 29, 2018 4:53:00 AM
Unknown said...: How do you count defective transposons in transcribed pre-mRNA (introns, UTRs especially)? Current human genome assembly is annotated as 40% transcribed to coding pre-mRNAs, not 20%. About 50% of that sequence is detectably derived from mobile elements. Does your "intron" category mean "intron sequence that isn't already counted in another category"?

(That is, more generally: your categories aren't exclusive, so they shouldn't add up to 100%.); Thursday, March 29, 2018 11:57:00 AM
Larry Moran said...: You are correct. Several of the categories overlap making it really difficult to present the data in a meaningful way. See What's in Your Genome?

I fudged the numbers by ignoring stuff in introns that's included in other categories. More than half of the intron sequences contain defective transposons and defective viruses. Some intron sequences include noncoding genes.

The total amount of DNA occupied by transposon fragments is closer to 65% when you allow for sequences that are more than 50 million years old so this compensates for the decision to ignore transposons in introns.

About 10% of the genome doesn't fit into any category - it's intergenic unique sequence junk DNA. I just realized that I forgot to include that category so the numbers don't add up to 100%. I have to redo the pie chart.; Thursday, March 29, 2018 12:37:00 PM
Larry Moran said...: Some genome annotations include the most distant 5′ start sites even if the RNA starting from those sites is extremely rare. Same for termination sites. These are probably not biologically relevant alternative transcripts. They should be ignored.

As a consequence, the size of most genes is inflated and so are the number of introns. I ignore the ridiculous false upstream promoters.; Thursday, March 29, 2018 12:50:00 PM
Unknown said...: I think it's important to show in this fig that ~40% of the genome is transcribed to coding pre-mRNAs. One of the main (misleading) arguments about pervasive transcription goes like "only 1% of the genome is coding, but most of the genome is transcribed". Many people are surprised to learn how much of the genome is covered by annotated coding pre-mRNA transcription units.; Thursday, March 29, 2018 1:47:00 PM
Unknown said...: What's your source for the 65% detectably mobile element derived number? I don't think that's right. It's more like 50-55% in most peoples' hands, except for one paper that I think is likely an outlier with a false positive issue.

(Mind you, I think the fraction of the genome that's derived from mobile elements is >90%, because they decay away so fast and can't be recognized; just talking "detectable" by similarity to known mobile element families.); Thursday, March 29, 2018 1:55:00 PM
Larry Moran said...: @Sean Eddy

I was impressed with the work of Platt et al. (2016). They make a good case that the percentage of transposon-related sequences are consistently underestimated in mammalian genomes.

de Koning et al. (2011) used a new algorithm that works better with short segments of repeat DNA and they estimate that 66-69% of the genome is derived from transposons. They explain why older techniques underestimate repeats. Their explanation seems credible to me.

I think it's reasonable to assume that 60% of the human genome is derived from transposons. It's almost certainly more than 50% and the rest depends on the look-back time (sequence similarity).

Dan Graur agrees with you that almost all junk DNA is derived from transposons (~90%). That can't be right since we know that a substantial fraction comes from integrated virus DNA (~9%). We also know that segmental duplications account for a significant fraction of excess DNA and some of that is unique-sequence DNA (e.g. coding regions and the functional part of genes for noncoding RNAs).

We'll never know for sure what fraction of junk DNA came from transposons but I don't think it's wise to claim that it's 100%.

de Koning, A., Gu, W., Castoe, T.A., Batzer, M. A., and Pollock, D.D. (2011) Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet, 7:e1002384. [doi: 10.1371/journal.pgen.1002384]

Platt, R.N., Blanco-Berdugo, L., and Ray, D.A. (2016) Accurate transposable element annotation is vital when analyzing new genome assemblies. Genome Biology and Evolution, 8:403-410. [doi: 10.1093/gbe/evw009]; Thursday, March 29, 2018 4:56:00 PM
Larry Moran said...: I agree that it's important to say that genes make up 30% of the genome. I'm making a big deal of this in my book, especially in the chapter on pervasive transciption.

I agree with you that the 1% figure is extremely misleading (i.e. fake news). It's 2018. Scientists and science writers should not be making such mistakes.

We can quibble about the exact percentage due to genes. I think the annotators are making a mistake by including extra DNA at the ends of the real genes. When you look at well-characterized genes you will often find that annotators have tacked on an extra few kilobases that represent spurious initiation. That's why some estimates suggest that genes cover 40% of the genome. I think this is junk RNA and real protein-coding genes represent only 23% of the genome.; Thursday, March 29, 2018 5:15:00 PM
Unknown said...: Mobile element derived = transposon derived + virus derived, so we don't disagree there.

You might have a second look at the de Koning et al. number. It's an outlier in the literature. Folks in my lab tested it against negative controls, and I believe it's an overestimate of what they can detect reliably. They're certainly correct that available methods fail to detect highly diverged mobile elements though.; Thursday, March 29, 2018 6:33:00 PM
Unknown said...: Yes, I agree it's reasonable to worry about overannotated low-usage 5' and 3' ends. I don't know of a better objective number to use though - where did you get 23% from? And are you applying this no-crappy-annotation standard equally to lncRNA annotation? I'm surprised that you say there's 6% (180 MB) in ncRNA introns; I would guess you'd have to rely on current genome-wide lncRNA annotation to get a number that high. Well-supported ncRNA gene transcripts definitely have some introns, but I wouldn't have imagined 180MB.; Thursday, March 29, 2018 6:40:00 PM
Larry Moran said...: Wherever possible, I try to rely on data from well-characterized genes and not on genome predictions. Here's a draft of what I've written so far ....

"Both protein-coding genes and noncoding genes can have introns. A typical protein-coding gene in humans has 6 or 7 exons and 7 or 8 introns.The number ranges from zero to 30 but the vast majority of protein-coding genes have fewer than 10 exons. In contrast, those genes that produce noncoding RNAs usually don’t have introns and those that do have only one intron (two exons) (Harrow et al., 2012).

The average number of introns in a human protein-coding gene is 7.7 and the average length of introns is 4.66 kb (Lynch, 2007 p. 49). The average exon is 0.15 kb (150 bp, enough to code for 50 amino acids). There are 8.7 exons so the average coding region is about 1300 bp if these numbers are accurate. That would encode a protein of about 435 amino acid residues with a molecular weight of about 54,000. That’s about right for an average protein.

The average contribution of introns in a gene is 7.7 × 4.66 kb = 35.9 kb or 35,900 base pairs. If you add together the exons and introns, you get 37,200 base pairs. We’ll assume that the average protein-coding gene (transcribed region) is 37,200 bp. or 37.2 kb.

There are roughly 20,000 of these protein-coding genes. They would occupy 37.2 × 20,000 = 744,000 kb or 23.3% of the genome. You may have heard that genes make up only 1% or 2% of our genome but that only counts exon sequences. The total amount of coding region (exons) is 20,000 × 1300 =26,000,000 bp (26 Mb) or 0.8% of the genome."

The estimate for noncoding RNA genes is more complicated. We know about the well-characterized genes but we have to allow for the existence of a number of other genes (e.g. genes for lncRNAs). I want to be fairly generous in my estimate but I also want to challenge the exaggerated claims.

Here's what I've got so far ....

"The genes for tRNAs account for less than 0.1% of the human genome. The genes for all the other small RNAs make up less than 0.1% of the genome. There are about 300 copies of each of the ribosomal RNA genes scattered over several chromosome in five clusters of about 60 genes each (Stults et al., 2008). This accounts for about 0.4% of the genome. The total for all of these well-characterized non-protein-coding genes is no more than 0.6% of your genome.

The main controversy over the number of genes is over how to count those parts of the genome that are transcribed to produce RNA (potential genes) but where there’s no known function for those RNAs. The latest estimate from the Ensemble website (July 2015) lists an additional 20,000 such “genes.” Most of them are bits of DNA complementary to a special type of noncoding RNA called “long noncoding RNA” (lncRNA). Note that the Ensemble annotators are using a different definition of a gene than the one I’m using. They don’t really care if the RNA product has a function or not so they describe any piece of DNA that’s transcribed as a “gene.” That’s not going to work because the correct definition of a gene requires that it produce a functional product. Otherwise it’s not a gene—although it may be a potential gene.

For now, let’s assume there are about 5,000 noncoding RNA genes in total. Many of them have large introns. These additional genes may cover about 6.4% of the genome if they contain lots of large introns. (This is a generous estimate.) Adding up noncoding and coding genes accounts for roughly 30% of the genome. The functional parts of these extra noncoding genes might only cover about 0.4% of the genome."; Thursday, March 29, 2018 6:57:00 PM
Pausanias said...: The exons and Numts labels look switched around.; Friday, March 30, 2018 10:55:00 AM
Pausanias said...: Nvm, that line between exons and introns looks like the pie slice line and Numts is so small it's probably invisible between exons and unknown.; Friday, March 30, 2018 10:57:00 AM
The Lorax said...: "low-usage 5' and 3' ends" what does this mean? Ok, I know what it means, but this context seems to exclude the population idea. In many populations, i.e. cancer cell lines, these may be low usage, but is that true in living humans? all cell types? genetic make ups?

I realize I'm referring to a small percent of a small percent, but is annotation great at a population level?; Friday, March 30, 2018 7:55:00 PM
Unknown said...: Looks like a cheesecake to me, the type that have a variety of flavors. Some flavors taste good (Numts, for example) and others not so well (defective transposons).; Sunday, April 01, 2018 4:50:00 PM
Gabo said...: Where can we get those numbers/proportions for referencing?; Thursday, April 05, 2018 4:58:00 PM
Unknown said...: Brilliant summary, thank you.; Friday, June 22, 2018 12:48:00 AM
Nesslig20 said...: Hello, good summary. Where can I get the citations for the numbers on this?; Sunday, March 31, 2019 4:49:00 PM
Luís TA said...: It seems a great article, but in order to be taken seriously it should include some citations :); Monday, April 22, 2019 4:17:00 PM
Larry Moran said...: I read dozens and dozens of papers on genome composition in order to come up with the values in the pie chart. There are no specific citations that I can give you to back up each value. If you have a question about any one of those values I'd be happy to explain why I think it's accurate and give you multiple, and often conflicting, references to the scientific literature.; Wednesday, April 24, 2019 10:37:00 AM
Larry Moran said...: Or you can wait until my book is published and check all the references that I include. :-); Wednesday, April 24, 2019 10:38:00 AM
Larry Moran said...: See above. You can take it seriously because I've been studying this problem for thirty years. That doesn't mean I'm 100% correct about every value in the pie chart but I'm confident that they are as accurate as we could get in 2018.

I've made some minor revisions and updates that I'll post later.; Wednesday, April 24, 2019 10:46:00 AM
Henry Norman said...: First, the pie chart tells me that ≈ 9% is “unknown” (is this “junk DNA”?)

Then, a “list of DNA sequences that are known or presumed to have a function (i.e. they are not junk)” is presented ... and concluded with “This adds up to 8% of the genome. The remaining 92% is junk.”

How should I read this?

How does the “list of known DNA sequences” (and the percentages shown) relate to the pie chart?

Is it “92% junk” or “9% junk”?; Saturday, June 08, 2019 10:37:00 PM
Anonymous said...: Lander has referred when decoding the genome to it being a ‘parts list’. He commented upon the need for the ‘operating manual’.

This is surely the most succinct comment made by a genomics researcher. Directly, or indirectly, the genes are collectively express proteins but also other biological moieties. I use the plural ‘genes’ because there are very few cases where a single gene expresses a single protein which explains why genetic engineering is rarely 100% effective. It is not that the genomic screening is wrong (although I am astonished that such a complex technique is considered to be so) it is that elements of chemistry are being overlooked ie the shape/conformity, energetics and reactivity are being overlooked. There is more to the genome than just its chemical structure.

Everything has to conform to the laws of chemistry and physics.; Thursday, April 06, 2023 3:53:00 AM

Quotations

The old argument of design in nature, as given by Paley, which formerly seemed to me to be so conclusive, fails, now that the law of natural selection has been discovered. We can no longer argue that, for instance, the beautiful hinge of a bivalve shell must have been made by an intelligent being, like the hinge of a door by man. There seems to be no more design in the variability of organic beings and in the action of natural selection, than in the course which the wind blows.Charles Darwin (c1880)

Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed, during a long course of years, from a point of view directly opposite to mine. It is so easy to hide our ignorance under such expressions as "plan of creation," "unity of design," etc., and to think that we give an explanation when we only restate a fact. Any one whose disposition leads him to attach more weight to unexplained difficulties than to the explanation of a certain number of facts will certainly reject the theory.

Charles Darwin (1859)

Science reveals where religion conceals. Where religion purports to explain, it actually resorts to tautology. To assert that "God did it" is no more than an admission of ignorance dressed deceitfully as an explanation...

Peter Atkins

Quotations

The world is not inhabited exclusively by fools, and when a subject arouses intense interest, as this one has, something other than semantics is usually at stake. Stephen Jay Gould (1982)
I have championed contingency, and will continue to do so, because its large realm and legitimate claims have been so poorly attended by evolutionary scientists who cannot discern the beat of this different drummer while their brains and ears remain tuned to only the sounds of general theory. Stephen Jay Gould (2002) p.1339
The essence of Darwinism lies in its claim that natural selection creates the fit. Variation is ubiquitous and random in direction. It supplies raw material only. Natural selection directs the course of evolutionary change. Stephen Jay Gould (1977)
Rudyard Kipling asked how the leopard got its spots, the rhino its wrinkled skin. He called his answers "just-so stories." When evolutionists try to explain form and behavior, they also tell just-so stories—and the agent is natural selection. Virtuosity in invention replaces testability as the criterion for acceptance. Stephen Jay Gould (1980)
Since 'change of gene frequencies in populations' is the 'official' definition of evolution, randomness has transgressed Darwin's border and asserted itself as an agent of evolutionary change. Stephen Jay Gould (1983) p.335
The first commandment for all versions of NOMA might be summarized by stating: "Thou shalt not mix the magisteria by claiming that God directly ordains important events in the history of nature by special interference knowable only through revelation and not accessible to science." In common parlance, we refer to such special interference as "miracle"—operationally defined as a unique and temporary suspension of natural law to reorder the facts of nature by divine fiat. Stephen Jay Gould (1999) p.84

Quotations

My own view is that conclusions about the evolution of human behavior should be based on research at least as rigorous as that used in studying nonhuman animals. And if you read the animal behavior journals, you'll see that this requirement sets the bar pretty high, so that many assertions about evolutionary psychology sink without a trace.

Jerry Coyne
Why Evolution Is True

I once made the remark that two things disappeared in 1990: one was communism, the other was biochemistry and that only one of them should be allowed to come back.

Sydney Brenner
TIBS Dec. 2000

It is naïve to think that if a species' environment changes the species must adapt or else become extinct.... Just as a changed environment need not set in motion selection for new adaptations, new adaptations may evolve in an unchanging environment if new mutations arise that are superior to any pre-existing variations

Douglas Futuyma

One of the most frightening things in the Western world, and in this country in particular, is the number of people who believe in things that are scientifically false. If someone tells me that the earth is less than 10,000 years old, in my opinion he should see a psychiatrist.

Francis Crick

There will be no difficulty in computers being adapted to biology. There will be luddites. But they will be buried.

Sydney Brenner

An atheist before Darwin could have said, following Hume: 'I have no explanation for complex biological design. All I know is that God isn't a good explanation, so we must wait and hope that somebody comes up with a better one.' I can't help feeling that such a position, though logically sound, would have left one feeling pretty unsatisfied, and that although atheism might have been logically tenable before Darwin, Darwin made it possible to be an intellectually fulfilled atheist

Richard Dawkins

Another curious aspect of the theory of evolution is that everybody thinks he understand it. I mean philosophers, social scientists, and so on. While in fact very few people understand it, actually as it stands, even as it stood when Darwin expressed it, and even less as we now may be able to understand it in biology.

Jacques Monod

The false view of evolution as a process of global optimizing has been applied literally by engineers who, taken in by a mistaken metaphor, have attempted to find globally optimal solutions to design problems by writing programs that model evolution by natural selection.

Richard Lewontin

More Recent Comments

Tuesday, March 27, 2018

What's In Your Genome? - The Pie Chart

37 comments :