tag:blogger.com,1999:blog-37148773.post5519917743191145449..comments2024-03-27T14:50:47.345-04:00Comments on <center>Sandwalk</center>: What's In Your Genome? - The Pie ChartLarry Moranhttp://www.blogger.com/profile/05756598746605455848noreply@blogger.comBlogger37125tag:blogger.com,1999:blog-37148773.post-15568895360938821682023-04-06T03:53:34.629-04:002023-04-06T03:53:34.629-04:00Lander has referred when decoding the genome to it...Lander has referred when decoding the genome to it being a ‘parts list’. He commented upon the need for the ‘operating manual’.<br /><br />This is surely the most succinct comment made by a genomics researcher. Directly, or indirectly, the genes are collectively express proteins but also other biological moieties. I use the plural ‘genes’ because there are very few cases where a single gene expresses a single protein which explains why genetic engineering is rarely 100% effective. It is not that the genomic screening is wrong (although I am astonished that such a complex technique is considered to be so) it is that elements of chemistry are being overlooked ie the shape/conformity, energetics and reactivity are being overlooked. There is more to the genome than just its chemical structure.<br /><br />Everything has to conform to the laws of chemistry and physics.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-37148773.post-77047861630046284382019-06-08T22:37:31.889-04:002019-06-08T22:37:31.889-04:00First, the pie chart tells me that ≈ 9% is “unknow...First, the pie chart tells me that ≈ 9% is “unknown” (is this “junk DNA”?)<br /><br />Then, a “list of DNA sequences that are known or presumed to have a function (i.e. they are not junk)” is presented ... and concluded with “This adds up to 8% of the genome. The remaining 92% is junk.”<br /> <br />How should I read this? <br /><br />How does the “list of known DNA sequences” (and the percentages shown) relate to the pie chart? <br /><br />Is it “92% junk” or “9% junk”?<br />Henry Normanhttps://www.blogger.com/profile/07818971016888907427noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-26267911158192720322019-04-24T10:46:12.592-04:002019-04-24T10:46:12.592-04:00See above. You can take it seriously because I'...See above. You can take it seriously because I've been studying this problem for thirty years. That doesn't mean I'm 100% correct about every value in the pie chart but I'm confident that they are as accurate as we could get in 2018. <br /><br />I've made some minor revisions and updates that I'll post later. Larry Moranhttps://www.blogger.com/profile/05756598746605455848noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-34407137478310411492019-04-24T10:38:59.399-04:002019-04-24T10:38:59.399-04:00Or you can wait until my book is published and che...Or you can wait until my book is published and check all the references that I include. :-)Larry Moranhttps://www.blogger.com/profile/05756598746605455848noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-23907905236599884082019-04-24T10:37:24.701-04:002019-04-24T10:37:24.701-04:00I read dozens and dozens of papers on genome compo...I read dozens and dozens of papers on genome composition in order to come up with the values in the pie chart. There are no specific citations that I can give you to back up each value. If you have a question about any one of those values I'd be happy to explain why I think it's accurate and give you multiple, and often conflicting, references to the scientific literature. Larry Moranhttps://www.blogger.com/profile/05756598746605455848noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-72391008400831529302019-04-22T16:17:18.521-04:002019-04-22T16:17:18.521-04:00It seems a great article, but in order to be taken...It seems a great article, but in order to be taken seriously it should include some citations :)Luís TAhttps://www.blogger.com/profile/05640601147093689427noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-12444106665443273682019-03-31T16:49:49.426-04:002019-03-31T16:49:49.426-04:00Hello, good summary. Where can I get the citations...Hello, good summary. Where can I get the citations for the numbers on this?Nesslig20https://www.blogger.com/profile/00209192071601766693noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-16948748594800967362018-06-22T00:48:38.074-04:002018-06-22T00:48:38.074-04:00Brilliant summary, thank you. Brilliant summary, thank you. Anonymoushttps://www.blogger.com/profile/13713508846267597418noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-12373614446078435522018-04-05T16:58:06.608-04:002018-04-05T16:58:06.608-04:00Where can we get those numbers/proportions for ref...Where can we get those numbers/proportions for referencing?Gabohttps://www.blogger.com/profile/17552375541700079254noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-70513217437789137202018-04-01T16:50:07.301-04:002018-04-01T16:50:07.301-04:00Looks like a cheesecake to me, the type that have ...Looks like a cheesecake to me, the type that have a variety of flavors. Some flavors taste good (Numts, for example) and others not so well (defective transposons). Anonymoushttps://www.blogger.com/profile/08245622913549106351noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-53881207066060907622018-03-30T19:55:24.329-04:002018-03-30T19:55:24.329-04:00"low-usage 5' and 3' ends" what ..."low-usage 5' and 3' ends" what does this mean? Ok, I know what it means, but this context seems to exclude the population idea. In many populations, i.e. cancer cell lines, these may be low usage, but is that true in living humans? all cell types? genetic make ups? <br /><br />I realize I'm referring to a small percent of a small percent, but is annotation great at a population level?The Loraxhttps://www.blogger.com/profile/13361004494346338824noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-71692831902913359842018-03-30T10:57:14.175-04:002018-03-30T10:57:14.175-04:00Nvm, that line between exons and introns looks lik...Nvm, that line between exons and introns looks like the pie slice line and Numts is so small it's probably invisible between exons and unknown.Pausaniashttps://www.blogger.com/profile/03729249155189095319noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-54769002080406598192018-03-30T10:55:19.966-04:002018-03-30T10:55:19.966-04:00The exons and Numts labels look switched around.The exons and Numts labels look switched around.Pausaniashttps://www.blogger.com/profile/03729249155189095319noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-47558013726053380082018-03-29T18:57:26.795-04:002018-03-29T18:57:26.795-04:00Wherever possible, I try to rely on data from well...Wherever possible, I try to rely on data from well-characterized genes and not on genome predictions. Here's a draft of what I've written so far .... <br /><br />"Both protein-coding genes and noncoding genes can have introns. A typical protein-coding gene in humans has 6 or 7 exons and 7 or 8 introns.The number ranges from zero to 30 but the vast majority of protein-coding genes have fewer than 10 exons. In contrast, those genes that produce noncoding RNAs usually don’t have introns and those that do have only one intron (two exons) (Harrow et al., 2012). <br /><br /> The average number of introns in a human protein-coding gene is 7.7 and the average length of introns is 4.66 kb (Lynch, 2007 p. 49). The average exon is 0.15 kb (150 bp, enough to code for 50 amino acids). There are 8.7 exons so the average coding region is about 1300 bp if these numbers are accurate. That would encode a protein of about 435 amino acid residues with a molecular weight of about 54,000. That’s about right for an average protein.<br /><br /> The average contribution of introns in a gene is 7.7 × 4.66 kb = 35.9 kb or 35,900 base pairs. If you add together the exons and introns, you get 37,200 base pairs. We’ll assume that the average protein-coding gene (transcribed region) is 37,200 bp. or 37.2 kb. <br /><br /> There are roughly 20,000 of these protein-coding genes. They would occupy 37.2 × 20,000 = 744,000 kb or 23.3% of the genome. You may have heard that genes make up only 1% or 2% of our genome but that only counts exon sequences. The total amount of coding region (exons) is 20,000 × 1300 =26,000,000 bp (26 Mb) or 0.8% of the genome." <br /><br />The estimate for noncoding RNA genes is more complicated. We know about the well-characterized genes but we have to allow for the existence of a number of other genes (e.g. genes for lncRNAs). I want to be fairly generous in my estimate but I also want to challenge the exaggerated claims.<br /><br />Here's what I've got so far ....<br /><br />"The genes for tRNAs account for less than 0.1% of the human genome. The genes for all the other small RNAs make up less than 0.1% of the genome. There are about 300 copies of each of the ribosomal RNA genes scattered over several chromosome in five clusters of about 60 genes each (Stults et al., 2008). This accounts for about 0.4% of the genome. The total for all of these well-characterized non-protein-coding genes is no more than 0.6% of your genome. <br /><br /> The main controversy over the number of genes is over how to count those parts of the genome that are transcribed to produce RNA (potential genes) but where there’s no known function for those RNAs. The latest estimate from the Ensemble website (July 2015) lists an additional 20,000 such “genes.” Most of them are bits of DNA complementary to a special type of noncoding RNA called “long noncoding RNA” (lncRNA). Note that the Ensemble annotators are using a different definition of a gene than the one I’m using. They don’t really care if the RNA product has a function or not so they describe any piece of DNA that’s transcribed as a “gene.” That’s not going to work because the correct definition of a gene requires that it produce a functional product. Otherwise it’s not a gene—although it may be a potential gene. <br /><br /> For now, let’s assume there are about 5,000 noncoding RNA genes in total. Many of them have large introns. These additional genes may cover about 6.4% of the genome if they contain lots of large introns. (This is a generous estimate.) Adding up noncoding and coding genes accounts for roughly 30% of the genome. The functional parts of these extra noncoding genes might only cover about 0.4% of the genome." <br />Larry Moranhttps://www.blogger.com/profile/05756598746605455848noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-32602758687409263152018-03-29T18:40:10.151-04:002018-03-29T18:40:10.151-04:00Yes, I agree it's reasonable to worry about ov...Yes, I agree it's reasonable to worry about overannotated low-usage 5' and 3' ends. I don't know of a better objective number to use though - where did you get 23% from? And are you applying this no-crappy-annotation standard equally to lncRNA annotation? I'm surprised that you say there's 6% (180 MB) in ncRNA introns; I would guess you'd have to rely on current genome-wide lncRNA annotation to get a number that high. Well-supported ncRNA gene transcripts definitely have some introns, but I wouldn't have imagined 180MB.Anonymoushttps://www.blogger.com/profile/03909661403450695948noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-73911908685584763772018-03-29T18:33:39.825-04:002018-03-29T18:33:39.825-04:00Mobile element derived = transposon derived + viru...Mobile element derived = transposon derived + virus derived, so we don't disagree there.<br /><br />You might have a second look at the de Koning et al. number. It's an outlier in the literature. Folks in my lab tested it against negative controls, and I believe it's an overestimate of what they can detect reliably. They're certainly correct that available methods fail to detect highly diverged mobile elements though.Anonymoushttps://www.blogger.com/profile/03909661403450695948noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-63360806666759351202018-03-29T17:15:47.697-04:002018-03-29T17:15:47.697-04:00I agree that it's important to say that genes ...I agree that it's important to say that genes make up 30% of the genome. I'm making a big deal of this in my book, especially in the chapter on pervasive transciption. <br /><br />I agree with you that the 1% figure is extremely misleading (i.e. fake news). It's 2018. Scientists and science writers should not be making such mistakes.<br /><br />We can quibble about the exact percentage due to genes. I think the annotators are making a mistake by including extra DNA at the ends of the real genes. When you look at well-characterized genes you will often find that annotators have tacked on an extra few kilobases that represent spurious initiation. That's why some estimates suggest that genes cover 40% of the genome. I think this is junk RNA and real protein-coding genes represent only 23% of the genome. <br /><br />Larry Moranhttps://www.blogger.com/profile/05756598746605455848noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-5263466766092878762018-03-29T16:56:19.420-04:002018-03-29T16:56:19.420-04:00@Sean Eddy
I was impressed with the work of Platt...@Sean Eddy<br /><br />I was impressed with the work of Platt et al. (2016). They make a good case that the percentage of transposon-related sequences are consistently underestimated in mammalian genomes. <br /><br />de Koning et al. (2011) used a new algorithm that works better with short segments of repeat DNA and they estimate that 66-69% of the genome is derived from transposons. They explain why older techniques underestimate repeats. Their explanation seems credible to me. <br /><br />I think it's reasonable to assume that 60% of the human genome is derived from transposons. It's almost certainly more than 50% and the rest depends on the look-back time (sequence similarity). <br /><br />Dan Graur agrees with you that almost all junk DNA is derived from transposons (~90%). That can't be right since we know that a substantial fraction comes from integrated virus DNA (~9%). We also know that segmental duplications account for a significant fraction of excess DNA and some of that is unique-sequence DNA (e.g. coding regions and the functional part of genes for noncoding RNAs). <br /><br />We'll never know for sure what fraction of junk DNA came from transposons but I don't think it's wise to claim that it's 100%. <br /> <br /><b>de Koning, A., Gu, W., Castoe, T.A., Batzer, M. A., and Pollock, D.D. (2011)</b> Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet, 7:e1002384. [<a href="https://doi.org/10.1371/journal.pgen.1002384" rel="nofollow">doi: 10.1371/journal.pgen.1002384</a>]<br /><br /><b>Platt, R.N., Blanco-Berdugo, L., and Ray, D.A. (2016)</b> Accurate transposable element annotation is vital when analyzing new genome assemblies. Genome Biology and Evolution, 8:403-410. [<a href="https://doi.org/10.1093/gbe/evw009" rel="nofollow">doi: 10.1093/gbe/evw009</a>]<br /><br /> Larry Moranhttps://www.blogger.com/profile/05756598746605455848noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-54476065104971703622018-03-29T13:55:44.922-04:002018-03-29T13:55:44.922-04:00What's your source for the 65% detectably mobi...What's your source for the 65% detectably mobile element derived number? I don't think that's right. It's more like 50-55% in most peoples' hands, except for one paper that I think is likely an outlier with a false positive issue. <br /><br />(Mind you, I think the fraction of the genome that's derived from mobile elements is >90%, because they decay away so fast and can't be recognized; just talking "detectable" by similarity to known mobile element families.)Anonymoushttps://www.blogger.com/profile/03909661403450695948noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-26095725446539304162018-03-29T13:47:04.838-04:002018-03-29T13:47:04.838-04:00I think it's important to show in this fig tha...I think it's important to show in this fig that ~40% of the genome is transcribed to coding pre-mRNAs. One of the main (misleading) arguments about pervasive transcription goes like "only 1% of the genome is coding, but most of the genome is transcribed". Many people are surprised to learn how much of the genome is covered by annotated coding pre-mRNA transcription units.Anonymoushttps://www.blogger.com/profile/03909661403450695948noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-45953591317736129382018-03-29T12:50:48.333-04:002018-03-29T12:50:48.333-04:00Some genome annotations include the most distant 5...Some genome annotations include the most distant 5′ start sites even if the RNA starting from those sites is extremely rare. Same for termination sites. These are probably not biologically relevant alternative transcripts. They should be ignored.<br /><br />As a consequence, the size of most genes is inflated and so are the number of introns. I ignore the ridiculous false upstream promoters. Larry Moranhttps://www.blogger.com/profile/05756598746605455848noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-3437122374225895972018-03-29T12:37:22.539-04:002018-03-29T12:37:22.539-04:00You are correct. Several of the categories overlap...You are correct. Several of the categories overlap making it really difficult to present the data in a meaningful way. See <a href="http://sandwalk.blogspot.ca/2011/05/whats-in-your-genome.html" rel="nofollow">What's in Your Genome?</a><br /><br />I fudged the numbers by ignoring stuff in introns that's included in other categories. More than half of the intron sequences contain defective transposons and defective viruses. Some intron sequences include noncoding genes. <br /><br />The total amount of DNA occupied by transposon fragments is closer to 65% when you allow for sequences that are more than 50 million years old so this compensates for the decision to ignore transposons in introns. <br /><br />About 10% of the genome doesn't fit into any category - it's intergenic unique sequence junk DNA. I just realized that I forgot to include that category so the numbers don't add up to 100%. I have to redo the pie chart. Larry Moranhttps://www.blogger.com/profile/05756598746605455848noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-18796160155039655942018-03-29T11:57:33.604-04:002018-03-29T11:57:33.604-04:00How do you count defective transposons in transcri...How do you count defective transposons in transcribed pre-mRNA (introns, UTRs especially)? Current human genome assembly is annotated as 40% transcribed to coding pre-mRNAs, not 20%. About 50% of that sequence is detectably derived from mobile elements. Does your "intron" category mean "intron sequence that isn't already counted in another category"?<br /><br />(That is, more generally: your categories aren't exclusive, so they shouldn't add up to 100%.)Anonymoushttps://www.blogger.com/profile/03909661403450695948noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-71330140710309257632018-03-29T04:53:54.234-04:002018-03-29T04:53:54.234-04:00
Hi Larry,
This is a great figure. Is it OK if ...<br />Hi Larry, <br /><br />This is a great figure. Is it OK if I use it in my undergrad lectures? Appropriate credit will be given, of course.<br /><br />SimonSnakehttps://www.blogger.com/profile/08017050832134070036noreply@blogger.comtag:blogger.com,1999:blog-37148773.post-67680203741353640322018-03-28T21:48:09.026-04:002018-03-28T21:48:09.026-04:00I recommend making the wedge for introns in protei...I recommend making the wedge for introns in protein-coding gene a little different shade, probably paler and/or greener. The pie chart is good. I don't care one way or the other about the 3-D component.Anonymousnoreply@blogger.com