Tuesday, March 27, 2018

What's In Your Genome? - The Pie Chart

Here's my latest compilation of the composition of the human genome. It's depicted in the form of a pie chart.1 [UPDATED: March 29, 2018]

There are several ways of estimating the amount of functional DNA and the amount of junk DNA. All of them are approximations but they only differ by a few percent. Note that several categories overlap. For example, introns and pseudogenes contain substantial amounts of DNA derived from transposons. The total amount of transposon-related sequence is about 60% when you include this fraction.

Here's the list of DNA sequences that are known or presumed to have a function (i.e. they are not junk).
  • functional parts of protein-coding genes (mostly coding regions): 1%
  • functional parts of genes for likely noncoding RNAs: 1%
  • regulatory sequences: 0.2%
  • scaffold attachment regions (SARs): 0.3%
  • origins of replication: 0.3%
  • centromeres: 1%
  • telomeres: 0.1%
  • functional virus sequences: 0.1%
  • functional transposons: 0.1%
  • conserved sequences of unknown function: ~3.9% (maximum)
This adds up to 8% of the genome. The remaining 92% is probably junk but the available evidence is consistent with another 2-5% being functional.

Most of the junk consists of: (1) very obvious examples of broken genes (pseudogenes 5%); (2) bits and pieces of transposon sequences that used to be capable of transposing but have mutated over time (45%); and (3) ancient viral sequences that have degenerated (9%). That's 59% of the genome that's clearly junk DNA.2 In addition, there's plenty of evidence that most intron sequences are dispensable. That accounts for another 28% of the genome.3 The total amount of junk DNA is at least 87%.

Note that protein-coding genes take up about 23% of the genome (1% exons, 22% introns). Genes for functional noncoding RNAs take up an additional 7% of the genome (1% exons, 6% introns). (Much of the functional region of noncoding RNA genes consists of 300 copies of ribosomal RNA genes (0.4%).) The important point is that roughly 30% of the genome is genes when we define a gene as a DNA sequence that's transcribed. A lot of this is junk within introns.

Also keep in mind that the well-characterized functional parts of the genome account for about 4% of the total but the functional regions of genes are only half of this total. Thus, we know that genes take up less than half of the total functional DNA in the human genome. This fact is not widely known even though the data is half-a-century old. I guess it takes some scientists a long time to learn the facts about the human genome.

Required reading for the junk DNA debate
Five Things You Should Know if You Want to Participate in the Junk DNA Debate


1. I have to use a pie chart because they were invented by my wife's ancestor, William Playfair.

2. I'm not ruling out the idea that some of these broken genes and fragments of genes might secondarily have acquired a new function. There are some clear examples of this and they are included in the functional categories. However, the vast majority of this DNA must be just as it appears - junk DNA.

3. The evidence for most of human intron sequences being junk is very compelling [Are introns mostly junk?].

37 comments:

  1. Pie charts should not be 3D. The 3rd D is uninformative and can be deceptive.
    Otherwise, good info.

    ReplyDelete
    Replies
    1. Are you familiar with the work "picky"? :-)

      I'm trying to make an attractive figure in order to get a point across to non-experts. The exact percentages and the area of the wedges aren't important. This is for a book chapter called "The Big Picture."

      P.S. Real pies are three-dimensional. :-)

      Delete
    2. I was very picky in my "work"(B-). Made a lot of graphs for scientific papers. Most 3D graphs, where the 3rd D is meaningless are deceptive (mostly non-intentionally) in one way or another. (one of0 My other buttons is the use of colour as the only or primary way of comparing data sets. With some colour choices(often the default choices), 5-15% of your audiencedon't get the message.
      I blame PowerPoint (fuck Microsoft).

      and... your pie isn't a real one.(B-)

      Delete
    3. Looks like a cheesecake to me, the type that have a variety of flavors. Some flavors taste good (Numts, for example) and others not so well (defective transposons).

      Delete
  2. Non-expert here, retired GP. Love the pie. Reinforces what one reads in (Sandwalk-recommended) Kat Arney's book 'Herding Hemingway's Cats.' Sober demonstration that some 96% of a gene's sequence...promotor region aside...is tossed-out introns.

    Approve of 3D pies, and the deeper the better. Await your book release.

    ReplyDelete
    Replies
    1. Ha... I bet you also like 3D bar charts with the sloping baseline, multi-coloured grouped bars, and a ribbon (3D line!) hovering over it all. pfthtt!

      Oh, I'm pickin'. (B-)
      (Oh, I've had cataract surgery.. make that (;-)

      Delete
  3. I will admit that I didn't know that RNA genes had introns. What percentage of them?

    ReplyDelete
    Replies
    1. Several tRNA genes have introns. Some ribosomal RNA genes have introns. Many of the genes for small RNAs are extensively processed, including intron splicing. Most lncRNA genes have large introns.

      Delete
  4. Two-dimensional pies would not be worth eating.

    ReplyDelete
    Replies
    1. The minimum number of dimensions for a pie to be worth eating is 5 (see Kaluza(1921) and Klein(1926) for details).

      Kaluza, Theodor (1921). "Zum Unitätsproblem in der Physik". Sitzungsber. Preuss. Akad. Wiss. Berlin. (Math. Phys.): 966–972.
      Klein, Oskar (1926). "Quantentheorie und fünfdimensionale Relativitätstheorie". Zeitschrift für Physik A. 37 (12): 895–906.

      Delete
  5. Once upon a time they thought that intron locations separated protein domains (as they do in a minority of proteins). When they were found in non-coding parts of RNAs that hypothesis lost favour.

    ReplyDelete
    Replies
    1. Wouldn't the tendency for introns to begin/end out of frame be a good clue?

      Delete
    2. You would think so. But the protein domain hypothesis persisted for some time, with its proponents turning cartwheels to sustain it.

      Delete
  6. I recommend making the wedge for introns in protein-coding gene a little different shade, probably paler and/or greener. The pie chart is good. I don't care one way or the other about the 3-D component.

    ReplyDelete

  7. Hi Larry,

    This is a great figure. Is it OK if I use it in my undergrad lectures? Appropriate credit will be given, of course.

    Simon

    ReplyDelete
  8. How do you count defective transposons in transcribed pre-mRNA (introns, UTRs especially)? Current human genome assembly is annotated as 40% transcribed to coding pre-mRNAs, not 20%. About 50% of that sequence is detectably derived from mobile elements. Does your "intron" category mean "intron sequence that isn't already counted in another category"?

    (That is, more generally: your categories aren't exclusive, so they shouldn't add up to 100%.)

    ReplyDelete
    Replies
    1. You are correct. Several of the categories overlap making it really difficult to present the data in a meaningful way. See What's in Your Genome?

      I fudged the numbers by ignoring stuff in introns that's included in other categories. More than half of the intron sequences contain defective transposons and defective viruses. Some intron sequences include noncoding genes.

      The total amount of DNA occupied by transposon fragments is closer to 65% when you allow for sequences that are more than 50 million years old so this compensates for the decision to ignore transposons in introns.

      About 10% of the genome doesn't fit into any category - it's intergenic unique sequence junk DNA. I just realized that I forgot to include that category so the numbers don't add up to 100%. I have to redo the pie chart.

      Delete
    2. Some genome annotations include the most distant 5′ start sites even if the RNA starting from those sites is extremely rare. Same for termination sites. These are probably not biologically relevant alternative transcripts. They should be ignored.

      As a consequence, the size of most genes is inflated and so are the number of introns. I ignore the ridiculous false upstream promoters.

      Delete
    3. What's your source for the 65% detectably mobile element derived number? I don't think that's right. It's more like 50-55% in most peoples' hands, except for one paper that I think is likely an outlier with a false positive issue.

      (Mind you, I think the fraction of the genome that's derived from mobile elements is >90%, because they decay away so fast and can't be recognized; just talking "detectable" by similarity to known mobile element families.)

      Delete
    4. @Sean Eddy

      I was impressed with the work of Platt et al. (2016). They make a good case that the percentage of transposon-related sequences are consistently underestimated in mammalian genomes.

      de Koning et al. (2011) used a new algorithm that works better with short segments of repeat DNA and they estimate that 66-69% of the genome is derived from transposons. They explain why older techniques underestimate repeats. Their explanation seems credible to me.

      I think it's reasonable to assume that 60% of the human genome is derived from transposons. It's almost certainly more than 50% and the rest depends on the look-back time (sequence similarity).

      Dan Graur agrees with you that almost all junk DNA is derived from transposons (~90%). That can't be right since we know that a substantial fraction comes from integrated virus DNA (~9%). We also know that segmental duplications account for a significant fraction of excess DNA and some of that is unique-sequence DNA (e.g. coding regions and the functional part of genes for noncoding RNAs).

      We'll never know for sure what fraction of junk DNA came from transposons but I don't think it's wise to claim that it's 100%.

      de Koning, A., Gu, W., Castoe, T.A., Batzer, M. A., and Pollock, D.D. (2011) Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet, 7:e1002384. [doi: 10.1371/journal.pgen.1002384]

      Platt, R.N., Blanco-Berdugo, L., and Ray, D.A. (2016) Accurate transposable element annotation is vital when analyzing new genome assemblies. Genome Biology and Evolution, 8:403-410. [doi: 10.1093/gbe/evw009]

      Delete
    5. Mobile element derived = transposon derived + virus derived, so we don't disagree there.

      You might have a second look at the de Koning et al. number. It's an outlier in the literature. Folks in my lab tested it against negative controls, and I believe it's an overestimate of what they can detect reliably. They're certainly correct that available methods fail to detect highly diverged mobile elements though.

      Delete
  9. I think it's important to show in this fig that ~40% of the genome is transcribed to coding pre-mRNAs. One of the main (misleading) arguments about pervasive transcription goes like "only 1% of the genome is coding, but most of the genome is transcribed". Many people are surprised to learn how much of the genome is covered by annotated coding pre-mRNA transcription units.

    ReplyDelete
    Replies
    1. I agree that it's important to say that genes make up 30% of the genome. I'm making a big deal of this in my book, especially in the chapter on pervasive transciption.

      I agree with you that the 1% figure is extremely misleading (i.e. fake news). It's 2018. Scientists and science writers should not be making such mistakes.

      We can quibble about the exact percentage due to genes. I think the annotators are making a mistake by including extra DNA at the ends of the real genes. When you look at well-characterized genes you will often find that annotators have tacked on an extra few kilobases that represent spurious initiation. That's why some estimates suggest that genes cover 40% of the genome. I think this is junk RNA and real protein-coding genes represent only 23% of the genome.

      Delete
    2. Yes, I agree it's reasonable to worry about overannotated low-usage 5' and 3' ends. I don't know of a better objective number to use though - where did you get 23% from? And are you applying this no-crappy-annotation standard equally to lncRNA annotation? I'm surprised that you say there's 6% (180 MB) in ncRNA introns; I would guess you'd have to rely on current genome-wide lncRNA annotation to get a number that high. Well-supported ncRNA gene transcripts definitely have some introns, but I wouldn't have imagined 180MB.

      Delete
    3. Wherever possible, I try to rely on data from well-characterized genes and not on genome predictions. Here's a draft of what I've written so far ....

      "Both protein-coding genes and noncoding genes can have introns. A typical protein-coding gene in humans has 6 or 7 exons and 7 or 8 introns.The number ranges from zero to 30 but the vast majority of protein-coding genes have fewer than 10 exons. In contrast, those genes that produce noncoding RNAs usually don’t have introns and those that do have only one intron (two exons) (Harrow et al., 2012).

      The average number of introns in a human protein-coding gene is 7.7 and the average length of introns is 4.66 kb (Lynch, 2007 p. 49). The average exon is 0.15 kb (150 bp, enough to code for 50 amino acids). There are 8.7 exons so the average coding region is about 1300 bp if these numbers are accurate. That would encode a protein of about 435 amino acid residues with a molecular weight of about 54,000. That’s about right for an average protein.

      The average contribution of introns in a gene is 7.7 × 4.66 kb = 35.9 kb or 35,900 base pairs. If you add together the exons and introns, you get 37,200 base pairs. We’ll assume that the average protein-coding gene (transcribed region) is 37,200 bp. or 37.2 kb.

      There are roughly 20,000 of these protein-coding genes. They would occupy 37.2 × 20,000 = 744,000 kb or 23.3% of the genome. You may have heard that genes make up only 1% or 2% of our genome but that only counts exon sequences. The total amount of coding region (exons) is 20,000 × 1300 =26,000,000 bp (26 Mb) or 0.8% of the genome."

      The estimate for noncoding RNA genes is more complicated. We know about the well-characterized genes but we have to allow for the existence of a number of other genes (e.g. genes for lncRNAs). I want to be fairly generous in my estimate but I also want to challenge the exaggerated claims.

      Here's what I've got so far ....

      "The genes for tRNAs account for less than 0.1% of the human genome. The genes for all the other small RNAs make up less than 0.1% of the genome. There are about 300 copies of each of the ribosomal RNA genes scattered over several chromosome in five clusters of about 60 genes each (Stults et al., 2008). This accounts for about 0.4% of the genome. The total for all of these well-characterized non-protein-coding genes is no more than 0.6% of your genome.

      The main controversy over the number of genes is over how to count those parts of the genome that are transcribed to produce RNA (potential genes) but where there’s no known function for those RNAs. The latest estimate from the Ensemble website (July 2015) lists an additional 20,000 such “genes.” Most of them are bits of DNA complementary to a special type of noncoding RNA called “long noncoding RNA” (lncRNA). Note that the Ensemble annotators are using a different definition of a gene than the one I’m using. They don’t really care if the RNA product has a function or not so they describe any piece of DNA that’s transcribed as a “gene.” That’s not going to work because the correct definition of a gene requires that it produce a functional product. Otherwise it’s not a gene—although it may be a potential gene.

      For now, let’s assume there are about 5,000 noncoding RNA genes in total. Many of them have large introns. These additional genes may cover about 6.4% of the genome if they contain lots of large introns. (This is a generous estimate.) Adding up noncoding and coding genes accounts for roughly 30% of the genome. The functional parts of these extra noncoding genes might only cover about 0.4% of the genome."

      Delete
    4. "low-usage 5' and 3' ends" what does this mean? Ok, I know what it means, but this context seems to exclude the population idea. In many populations, i.e. cancer cell lines, these may be low usage, but is that true in living humans? all cell types? genetic make ups?

      I realize I'm referring to a small percent of a small percent, but is annotation great at a population level?

      Delete
  10. The exons and Numts labels look switched around.

    ReplyDelete
  11. Nvm, that line between exons and introns looks like the pie slice line and Numts is so small it's probably invisible between exons and unknown.

    ReplyDelete
  12. Where can we get those numbers/proportions for referencing?

    ReplyDelete
  13. Hello, good summary. Where can I get the citations for the numbers on this?

    ReplyDelete
    Replies
    1. I read dozens and dozens of papers on genome composition in order to come up with the values in the pie chart. There are no specific citations that I can give you to back up each value. If you have a question about any one of those values I'd be happy to explain why I think it's accurate and give you multiple, and often conflicting, references to the scientific literature.

      Delete
    2. Or you can wait until my book is published and check all the references that I include. :-)

      Delete
  14. It seems a great article, but in order to be taken seriously it should include some citations :)

    ReplyDelete
    Replies
    1. See above. You can take it seriously because I've been studying this problem for thirty years. That doesn't mean I'm 100% correct about every value in the pie chart but I'm confident that they are as accurate as we could get in 2018.

      I've made some minor revisions and updates that I'll post later.

      Delete
  15. First, the pie chart tells me that ≈ 9% is “unknown” (is this “junk DNA”?)

    Then, a “list of DNA sequences that are known or presumed to have a function (i.e. they are not junk)” is presented ... and concluded with “This adds up to 8% of the genome. The remaining 92% is junk.”

    How should I read this?

    How does the “list of known DNA sequences” (and the percentages shown) relate to the pie chart?

    Is it “92% junk” or “9% junk”?

    ReplyDelete
  16. Lander has referred when decoding the genome to it being a ‘parts list’. He commented upon the need for the ‘operating manual’.

    This is surely the most succinct comment made by a genomics researcher. Directly, or indirectly, the genes are collectively express proteins but also other biological moieties. I use the plural ‘genes’ because there are very few cases where a single gene expresses a single protein which explains why genetic engineering is rarely 100% effective. It is not that the genomic screening is wrong (although I am astonished that such a complex technique is considered to be so) it is that elements of chemistry are being overlooked ie the shape/conformity, energetics and reactivity are being overlooked. There is more to the genome than just its chemical structure.

    Everything has to conform to the laws of chemistry and physics.

    ReplyDelete