More Recent Comments

Wednesday, February 10, 2021

The 20th anniversary of the human genome sequence:
4. Functional DNA in our genome

We know a lot more about the human genome than we did when the draft sequences were published 20 years ago. One of the most important discoveries is the recognition and extent of true functional sequences in the genome. Genes are one example of such functional sequence but only a minor component (about 1.4%). Most of the functional regions of the genome are not genes.

Here's a list of functional DNA in our genome other than the functional part of genes.

  • Centromeres: There are 24 different centromeres and the average size is four million base pairs. Most of this is repetitive DNA and it adds up to about 3% of the genome. The total amount of centromeric DNA ranges from 2%-10% in different individuals. It's unlikely that all of the centromeric DNA is essential; about 1% seems to be a good estimate.
  • Telomeres: Telomeres are repetivie DNA sequences at the ends of chromosomes. They are required for the proper replication of DNA and they take up about 0.1% of the genome sequence.
  • Origins of replication: DNA replication begins at origins of replication. The size of each origin has not been established with certainlty but it's safe to assume that 100 bp is a good estimate. There are about 100,000 origin sequences but it's unlikely that all of them are functional or necessary. It's reasonable to assume that only 30,000 - 50,000 are real origins and that means 0.3% of the genome is devoted to origins of replication.
  • Regulatory sequences: The transcription of every gene is controlled by sequences that lie outside of the genes, usually at the 5′ end. The total amount of regulatory sequence is controversial but it seems reasonable to assume about 200 bp per gene for a total of five million bp or less than 0.2% of the genome (0.16%). The most extreme claim is about 2,400 bp per gene or 1.8% of the genome.
  • Scaffold attachment regions (SARs): Human chromatin is organized into about 100,000 large loops. The base of each loop consists of particular proteins bound to specific sequences called anchor loop sequences. The nomenclature is confusing; the original term (SAR) isn't as popular today as it was 35 years ago but that doesn't change the fact that about 0.3% of the genome is required to organize chromatin.
  • Transposons: Most of the transposon-related sequencs in our genome are just fragments of defective transposons but there are a few active ones. They account for only a tiny fraction of the genome.
  • Viruses: Functional virus DNA sequences account for less than 0.1% of the genome.

If you add up all the functional DNA from this list, you get to somewhere between 2% and 3% of the genome.


Image credit: Wikipedia.

Monday, February 08, 2021

The 20th anniversary of the human genome sequence: 3. How many genes?

This week marks the 20th anniversary of the publication of the first drafts of the human genome sequence. Science choose to celebrate the achievement with a series of articles that had little to say about the scientific discoveries arising out of the sequencing project; one of the articles praised the opennesss of sequence data without mentioning that the journal had violated its own policy on openness by publishing the Celera sequence [The 20th anniversary of the human genome sequence: 1. Access to the data and the complicity of Science].

I've decided to post a few articles about the human genome beginning with one on finishing the sequence. In this post I'll summarize the latest data on the number of genes in the human genome.

Saturday, February 06, 2021

The 20th anniversary of the human genome sequence:
2. Finishing the sequence

It's been 20 years since the first drafts of the human genome sequence were published. These first drafts from the International Human Genome Project (IHGP) and Celera were far from complete. The IHGP sequence covered about 82% of the genome and it contained about 250,000 gaps and millions of sequencing errors.

Celera never published an updated sequences but IHPG published a "finished" sequence in October 2004. It covered about 92% of the genome and had "only" 300 gaps. The error rate of the finished sequence was down to 10-5.

International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431:931-945. doi: 10.1038/nature03001

We've known for many decades that the correct size of the human genome is close to 3,200,000 kb or 3.2 Gb. There's isn't a more precise number because different individuals have different amounts of DNA. The best average estimate was 3,286 Gb based on the sequence of 22 autosomes, one X chromosome, and one Y chromosome (Morton 1991). The amount of actual nucleotide sequence in the latest version of the reference genome (GRCh38.p13) is 3,110,748,599 bp and the estimated total size is 3,272,116,950 bp based on estimating the size of the remaining gaps. This means that 95% of the genome has been sequenced. [see How much of the human genome has been sequenced? for a discussion of what's missing.]

Recent advances in sequencing technology have produced sequence data covering the repetitive regions in the gaps and the first complete sequence of a human chromosome (X) was published in 2019 [First complete sequence of a human chromosome]. It's now possible to complete the human genome reference sequence by sequencing at least one individual but I'm not sure that the effort and the expense are worth it.


Image credit the figure is from Miga et al. (2019)

Miga, K.H., Koren, S., Rhie, A., Vollger, M.R., Gershman, A., Bzikadze, A., Brooks, S., Howe, E., Porubsky, D., Logsdon, G.A. et al. (2019) Telomere-to-telomere assembly of a complete human X chromosome. Nature 585:79-84. [doi: 10.1038/s41586-020-2547-7]

Morton, N.E. (1991) Parameters of the human genome. Proceedings of the National Academy of Sciences 88:7474-7476. [doi: 10.1073/pnas.88.17.7474]

The 20th anniversary of the human genome sequence: 1. Access to the data and the complicity of Science

The first drafts of the human genome sequence were published 20 years ago. The paper from the International Human Genome Project (IHGP) was published in Nature on Febuary 15, 2001 and the paper from Celera was published in Science on February 16, 2001.

The original agreement was to publish both papers in Science but IHGP refused to publish their sequence in that journal when it choose to violate its own policy by allowing Celera to restrict access to its data. I highly recommend James Shreeve's book The Genome War for the history behind these publications. It paints an accurate, but not pretty, picture of science and politics.

Lander, E., Linton, L., Birren, B., Nusbaum, C., Zody, M., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., Funke, R., Gage, D., Harris, K., Heaford, A., Howland, J., Kann, L., Lehoczky, J., LeVine, R., McEwan, P., McKernan, K., Meldrim, J., Mesirov, J., Miranda, C., Morris, W., Naylor, J., Raymond, C., Rosetti, M., Santos, R., Sheridan, A. and Sougnez, C. (2001) Initial sequencing and analysis of the human genome. Nature 409:860-921. doi: 10.1038/35057062

Venter, J., Adams, M., Myers, E., Li, P., Mural, R., Sutton, G., Smith, H., Yandell, M., Evans, C., Holt, R., Gocayne, J., Amanatides, P., Ballew, R., Huson, D., Wortman, J., Zhang, Q., Kodira, C., Zheng, X., Chen, L., Skupski, M., Subramanian, G., Thomas, P., Zhang, J., Gabor Miklos, G., Nelson, C., Broder, S., Clark, A., Nadeau, J., McKusick, V. and Zinder, N. (2001) The sequence of the human genome. Science 291:1304 - 1351. doi: 10.1126/science.1058040

Thursday, December 31, 2020

On the importance of controls

When doing an exeriment, it's important to keep the number of variables to a minimum and it's important to have scientific controls. There are two types of controls. A negative control covers the possibility that you will get a signal by chance; for example, if you are testing an enzyme to see whether it degrades sugar then the negative control will be a tube with no enzyme. Some of the sugar may degrade spontaneoulsy and you need to know this. A positive control is when you deliberately add something that you know will give a positive result; for example, if you are doing a test to see if your sample contains protein then you want to add an extra sample that contains a known amount of protein to make sure all your reagents are working.

Lots of controls are more complicated than the examples I gave but the principle is important. It's true that some experiments don't appear to need the appropriate controls but that may be an illusion. The controls might still be necessary in order to properly interpret the results but they're not done because they are very difficult. This is often true of genomics experiments.

Saturday, December 19, 2020

What do believers in epigenetics think about junk DNA?

I've been writing some stuff about epigenetics so I've been reading papers on how to define the term [What the heck is epigenetics? ]. Turns out there's no universal definition but I discovered that scientists who write about epigenetics are passionate believers in epigenetics no matter how you define it. Surprisingly (not!), there seems to be a correlation between belief in epigenetics and other misconceptions such as the classic misunderstanding of the Central Dogma of Molecular Biology and rejection of junk DNA [The Extraordinary Human Epigenome]

Here's an illustraton of this correlation from the introduction to a special issue on epigenetics in Philosophical Transactions B.

Ganesan, A. (2018) Epigenetics: the first 25 centuries, Philosophical Transactions B. 373: 20170067. [doi: 10.1098/rstb.2017.0067]

Epigenetics is a natural progression of genetics as it aims to understand how genes and other heritable elements are regulated in eukaryotic organisms. The history of epigenetics is briefly reviewed, together with the key issues in the field today. This themed issue brings together a diverse collection of interdisciplinary reviews and research articles that showcase the tremendous recent advances in epigenetic chemical biology and translational research into epigenetic drug discovery.

In addition to the misconceptions, the text (see below) emphasizes the heritable nature of epigenetic phenomena. This idea of heritablity seems to be a dominant theme among epigenetic believers.

A central dogma became popular in biology that equates life with the sequence DNA → RNA → protein. While the central dogma is fundamentally correct, it is a reductionist statement and clearly there are additional layers of subtlety in ‘how’ it is accomplished. Not surprisingly, the answers have turned out to be far more complex than originally imagined, and we are discovering that the phenotypic diversity of life on Earth is mirrored by an equal diversity of hereditary processes at the molecular level. This lies at the heart of modern day epigenetics, which is classically defined as the study of heritable changes in phenotype that occur without an underlying change in genome sequence. The central dogma's focus on genes obscures the fact that much of the genome does not code for genes and indeed such regions were derogatively lumped together as ‘junk DNA’. In fact, these non-coding regions increase in proportion as we climb up the evolutionary tree and clearly play a critical role in defining what makes us human compared with other species.

At the risk of bearting a dead horse, I'd like to point out that the author is wrong about the Central Dogma and wrong about junk DNA. He's right about the heritablitly of some epigenetic phenomena such as methylation of DNA but that fact has been known for almost five decades and so far it hasn't caused a noticable paradigm shift, unless I missed it [Restriction, Modification, and Epigenetics].


Saturday, December 05, 2020

Mouse traps Michael Denton

Michael Denton is a New Zealand biochemist, a Senior Fellow at the Discovery Institute, and the author of two Intelligent Design Creationist books: Evolution: A Theory in Crisis (1985) and Nature's Destiny (1998).

He has just read Michael Behe's latest book and he (Denton) is impressed [Praise for Behe’s Latest: “Facts Before Theory”]:

Behe brings out more forcibly than any other author I have recently read just how vacuous and biased are the criticisms of his work and of the ID position in general by so many mainstream academic defenders of Darwinism. And what is so telling about his many wonderfully crafted responses to his Darwinian critics is that it is Behe who is putting the facts before theory while his many detractors — Kenneth Miller, Jerry Coyne, Larry Moran, Richard Lenski, and others — are putting theory before the facts. In short, this volume shows that it is Behe rather than his detractors who is carefully following the evidence.

I don't know what planet Michael Denton is living on—probably the same one as Michael Behe—but let's make one thing clear about facts and evidence. Behe's entire argument is based on the "fact" that he can't see how Darwin's theory of natural selection can account for the evolution of complex features: therefore god(s) must have done it. This is NOT putting facts before theory and it is NOT carefully following the evidence.

It's just a somewhat sophisticated version of god of the gaps based on Behe's lack of understanding of the basic mechanisms of evolution.

(See, Of mice and Michael, where I explain why Michael Behe fails to answer my critique of The Edge of Evolution.)


Tuesday, December 01, 2020

Of mice and Michael

Michael Behe has published a book containing most of his previously published responses to critics. I was anxious to see how he dealt with my criticisms of The Edge of Evolution but I was disappointed to see that, for the most part, he has just copied excerpts from his 2014 blog posts (pp. 335-355).

I think it might be worthwhile to review the main issues so you can see for yourself whether Michael Behe really answered his critics as the title of his most recent book claims. You can check out the dueling blog posts at the end of this summary to see how the discussion evolved in real time more than four years ago.

Many Sandwalk readers participated in the debate back then and some of them are quoted in Behe's book although he usually just identifies them as commentators.

My Summary

Michael Behe has correctly indentified an extremely improbably evolution event; namely, the development of chloroquine resistance in the malaria parasite. This is an event that is close to the edge of evolution, meaning that more complex events of this type are beyond the edge of evolution and cannot occur naturally. However, several of us have pointed out that his explanation of how that event occurred is incorrect. This is important because he relies on his flawed interpretation of chloroquine resistance to postulate that many observed events in evolution could not possibly have occurred by natural means. Therefore, god(s) must have created them.

In his response to this criticism, he completely misses the point and fails to understand that what is being challenged is his misinterpretation of the mechanisms of evolution and his understanding of mutations.


The main point of The Edge of Evolution is that many of the beneficial features we see could only have evolved by selecting for a number of different mutations where none of the individual mutations confer a benefit by themselves. Behe claims that these mutations had to occur simultaneously or at least close together in time. He argues that this is possible in some cases but in most cases the (relatively) simultaneous occurrence of multiple mutations is beyond the edge of evolution. The only explanation for the creation of these beneficial features is god(s).

Tuesday, November 17, 2020

Using modified nucleotides to make mRNA vaccines

The key features of the mRNA vaccines are the use of modified nucleotides in their synthesis and the use of lipid nanoparticles to deliver them to cells. The main difference between the Pfizer/BioNTech vaccine and the Moderna vaccine is in the delivery system. The lipid vescicules used by Moderna are somewhat more stable and the vaccine doesn't need to be kept constantly at ultra-low temperatures.

Both vaccines use modified RNAs. They synthesize the RNA using modified nucleotides based on variants of uridine; namely, pseudouridine, N1-methylpseudouridine and 5-methylcytidine. (The structures of the nucleosides are from Andries et al., 2015).) The best versions are those that use both 5-methylcytidine and N1-methylpseudouridine.

I'm not an expert on these mRNAs and their delivery systems but the way I understand it is that regular RNA is antigenic—it induces antibodies against it, presumably when it is accidently released from the lipid vesicles outside of the cell. The modified versions are much less antigenic. As an added bonus, the modified RNA is more stable and more efficiently translated.

Two of the key papers are ...

Andries, O., Mc Cafferty, S., De Smedt, S.C., Weiss, R., Sanders, N.N. and Kitada, T. (2015) "N1-methylpseudouridine-incorporated mRNA outperforms pseudouridine-incorporated mRNA by providing enhanced protein expression and reduced immunogenicity in mammalian cell lines and mice." Journal of Controlled Release 217: 337-344. [doi: 10.1016/j.jconrel.2015.08.051]

Pardi, N., Tuyishime, S., Muramatsu, H., Kariko, K., Mui, B.L., Tam, Y.K., Madden, T.D., Hope, M.J. and Weissman, D. (2015) "Expression kinetics of nucleoside-modified mRNA delivered in lipid nanoparticles to mice by various routes." Journal of Controlled Release 217: 345-351. [doi: 10.1016/j.jconrel.2015.08.007]


Sunday, November 15, 2020

Why is the Central Dogma so hard to understand?

The Central Dogma of molecular biology states ...

... once (sequential) information has passed into protein it cannot get out again (F.H.C. Crick, 1958).

The central dogma of molecular biology deals with the detailed residue-by-residue transfer of sequential information. It states that such information cannot be transferred from protein to either protein or nucleic acid (F.H.C. Crick, 1970).

This is not difficult to understand since Francis Crick made it very clear in his original 1958 paper and again in his 1970 paper in Nature [see Basic Concepts: The Central Dogma of Molecular Biology]. There's nothing particularly complicated about the Central Dogma. It merely states the obvious fact that sequence information can flow from nucleic acid to protein but not the other way around.

So, why do so many scientists have trouble grasping this simple idea? Why do they continue to misinterpret the Central Dogma while quoting Crick? I seems obvious that they haven't read the paper(s) they are referencing.

I just came across another example of such ignorance and it is so outrageous that I just can't help sharing it with you. Here's a few sentences from a recent review in the 2020 issue of Annual Reviews of Genomics and Human Genetics (Zerbino et al., 2020).

Once the role of DNA was proven, genes became physical components. Protein-coding genes could be characterized by the genetic code, which was determined in 1965, and could thus be defined by the open reading frames (ORFs). However, exceptions to Francis Crick's central dogma of genes as blueprints for protein synthesis (Crick, 1958) were already being uncovered: first tRNA and rRNA and then a broad variety of noncoding RNAs.

I can't imagine what the authors were thinking when they wrote this. If the Central Dogma actually said that the only role for genes was to make proteins then surely the discovery of tRNA and rRNA would have refuted the Central Dogma and relegated it to the dustbin of history. So why bother even mentioning it in 2020?


Crick, F.H.C. (1958) On protein synthesis. Symp. Soc. Exp. Biol. XII:138-163. [PDF]

Crick, F. (1970) Central Dogma of Molecular Biology. Nature 227, 561-563. [PDF file]

Zerbino, D.R., Frankish, A. and Flicek, P. (2020) "Progress, Challenges, and Surprises in Annotating the Human Genome." Annual review of genomics and human genetics 21:55-79. [doi: 10.1146/annurev-genom-121119-083418]

Wednesday, November 11, 2020

On the misrepresentation of facts about lncRNAs

I've been complaining for years about how opponents of junk DNA misrepresent and distort the scientific literature. The same complaints apply to the misrepresentation of data on alternative splicing and on the prevalence of noncoding genes. Sometimes the misrepresentation is subtle so you hardly notice it.

I'm going to illustrate subtle misrepresentation by quoting a recent commentary on lncRNAs that's just been published in BioEssays. The main part of the essay deals with ways of determining the function of lncRNAs with an emphasis on the sructures of RNA and RNA-protein complexes. The authors don't make any specific claims about the number of functional RNAs in humans but it's clear from the context that they think this number is very large.

Wednesday, October 07, 2020

Undergraduate education in biology: no vision, no change

I was looking at the Vision and Change document the other day and it made me realize that very little has changed in undergraduate education. I really shouldn't be surprised since I reached the same conclusion in 2015—six years after the recommendations were published [Vision and Change] [Why can't we teach properly?].

The main recommendations of Vision and Change are that undergraduate education should adopt the proven methods of student-centered education and should focus on core concepts rather than memorization of facts. Although there has been some progress, it's safe to say that neither of these goals have been achieved in the vast majority of biology classes, including biochemistry and molecular biology classes.

Things are getting even worse in this time of COVID-19 because more and more classes are being taught online and there seems to be general agreement that this is okay. It is not okay. Online didactic lectures go against everything in the Vision and Change document. It may be possible to develop online courses that practice student-centered, concept teaching that emphasizes critical thinking but I've seen very few attempts.

Here are a couple of quotations from Vision and Change that should stimulate your thinking.

Traditionally, introductory biology [and biochemistry] courses have been offered as three lectures a week, with, perhaps, an accompanying two- or three-hour laboratory. This approach relies on lectures and a textbook to convey knowledge to the student and then tests the student's acquisition of that knowledge with midterm and final exams. Although many traditional biology courses include laboratories to provide students with hands-on experiences, too often these "experiences" are not much more than guided exercises in which finding the right answer is stressed while providing students with explicit instructions telling them what to do and when to do it.
"Appreciating the scientific process can be even more important than knowing scientific facts. People often encounter claims that something is scientifically known. If they understand how science generates and assesses evidence bearing on these claims, they possess analytical methods and critical thinking skills that are relevant to a wide variety of facts and concepts and can be used in a wide variety of contexts.”

National Science Foundation, Science and Technology Indicators, 2008

If you are a student and this sounds like your courses, then you should demand better. If you are an instructor and this sounds like one of your courses then you should be ashamed; get some vision and change [The Student-Centered Classroom].

Although the definition of student-centered learning may vary from professor to professor, faculty generally agree that student-centered classrooms tend to be interactive, inquiry-driven, cooperative, collaborative, and relevant. Three critical components are consistent throughout the literature, providing guidelines that faculty can apply when developing a course. Student-centered courses and curricula take into account student knowledge and experience at the start of a course and articulate clear learning outcomes in shaping instructional design. Then they provide opportunities for students to examine and discuss their understanding of the concepts presented, offering frequent and varied feedback as part of the learing process. As a result, student-centered science classrooms and assignments typically involve high levels of student-student and student-faculty interaction; connect the course subject matter to topics students find relevant; minimize didactic presentations; reflect diverse views of scientific inquiry, including data presentation, argumentation, and peer review; provide ongoing feedback to both the student and professor about the student's learning progress; and explicitly address learning how to learn.

This is a critical time for science education since science is under attack all over the world. We need to make sure that university students are prepared to deal with scientific claims and counter-claims for the rest of their lives after they leave university. This means that they have to be skilled at critical thinking and that's a skill that can only be taught in a student-centered classroom where students can practice argumentation and learn the importance of evidence. Memorizing the enzymes of the Krebs Cycle will not help them understand climate change or why they should wear a mask in the middle of a pandemic.


Saturday, October 03, 2020

On the importance of random genetic drift in modern evolutionary theory

The latest issue of New Scientist has a number of articles on evolution. All of them are focused on extending and improving the current theory of evolution, which is described as Darwin's version of natural selection [New Scientist doesn't understand modern evolutionary theory].

Most of the criticisms come from a group who want to extend the evolutionary synthesis (EES proponents). Their main goal is to advertise mechanisms that are presumed to enhance adaptation but that weren't explicitly included in the Modern Synthesis that was put together in the late 1940s.

One of the articles addresses random genetic drift [see Survival of the ... luckiest]. The emphasis in this short article is on the effects of drift in small populations and it gives examples of reduced genetic diversity in small populations.

Wednesday, September 30, 2020

New Scientist doesn't understand modern evolutionary theory

New Scientist has devoted much of their September 26th issue to evolution, but not in a good way. Their emphasis is on 13 ways that we must rethink evolution. Readers of this blog are familiar with this theme because New Scientist is talking about the Extended Evolutionary Synthesis (EES)—a series of critiques of the Modern Synthesis in an attempt to overthrow or extend it [The Extended Evolutionary Synthesis - papers from the Royal Society meeting].

My main criticsm of EES is that its proponents demonstrate a remarkable lack of understanding of modern evolutionary theory and they direct most of their attacks against the old adaptationist version of the Modern Synthesis that was popular in the 1950s. For the most part, EES proponents missed the revolution in evolutionary theory that occrred in the late 1960s with the development of Neutral Theory, Nearly-Neutral Theory, and the importance of random genetic drift. EES proponents have shown time and time again that they have not bothered to read a modern textbook on population genetics.

Tuesday, September 22, 2020

The Function Wars Part VIII: Selected effect function and de novo genes

Discussions about the meaning of the word "function" have been going on for many decades, especially among philosphers who love that sort of thing. The debate intensified following the ENCODE publicity hype disaster in 2012 where ENCODE researchers used the word function an entirely inappropriate manner in order to prove that there was no junk in our genome. Since then, a cottege indiustry based on discussing the meaning of function has grown up in the scientific literature and dozens of papers have been published. This may have enhanced a lot of CV's but none of these papers has proposed a rigorous definition of function that we can rely on to distinguish functional DNA from junk DNA.

The world is not inhabited exclusively by fools and when a subject arouses intense interest and debate, as this one has, something other than semantics is usually at stake.
Stephen Jay Gould (1982)

That doesn't mean that all of the papers have been completely useless. The net result has been to focus attention on the one reliable definition of function that most biologists can accept; the selected effect function. The selected effect function is defined as ...