More Recent Comments

Showing posts with label Genes. Show all posts
Showing posts with label Genes. Show all posts

Sunday, November 01, 2015

3,000 new genes discovered in the human genome - dark matter revealed

Let's look a a recent paper published by a large group of medical researchers at the University of California, Los Angeles (USA). The paper was published online a few days ago (Oct. 26, 2015) in Nature Immunology.

The authors clam to have discoverd 3,000 previously unknown genes in the human genome.

The complete reference is ...

Friday, October 16, 2015

Human mutation rates

I was excited when I saw the cover of the Sept. 25th (2015) issue of Science because I'm very interested in human mutation rates. I figured there would have to be an article that discussed current views on the number of new mutations per generation even though I was certain that the focus would be on the medical relevance of mutations. I was right. There was one article that discussed germline mutations and the overall mutation rate.

The article by Shendure and Akay (2015) is the only one that addresses human mutation rates in any meaningful way. They begin their review with ...
Despite the exquisite molecular mechanisms that have evolved to replicate and repair DNA with high fidelity, mutations happen. Each human is estimated to carry on average ~60 de novo point mutations (with considerable variability among individuals) that arose in the germline of their parents (1–4). Consequently, across all seven billion humans, about 1011 germline mutations—well in excess of the number of nucleotides in the human genome—occurred in just the last generation (5). Furthermore, the number of somatic mutations that arise during development and throughout the lifetime of each individual human is potentially staggering, with proliferative tissues such as the intestinal epithelium expected to harbor a mutation at nearly every genomic site in at least one cell by the time an individual reaches the age of 60 (6).

Sunday, October 04, 2015

Genetic variation in human populations

The Human Genome Project produced a high quality reference genome that serves as a standard to measure genetic variation. Every new human genome that's sequenced can be compared with the reference genome to detect differences due to mutation. It's possible to build large databases of genetic variation by sequencing genomes from different populations. Genetic variation can be used to infer evolutionary history and to test theories of population genetics. Detailed maps of genetic variation can also be used to infer selection (genetic sweeps) and distinguish it from random genetic drift.

In addition to this basic science, the analysis of multiple human genomes can be used to map genetic disease loci through association of various haplotypes with disease. The technique is called genome wide association studies (GWAS). The same technology can be used to map other phenotypes to identify the genes responsible.

The 1000 Genomes Project Consortium has just published their latest efforts in a recent issue of Nature (Oct. 1, 2015) (The 1000 Genomes Project Consortium, 2015; Studmant et al., 2015). They looked at the genomes of 2,504 individuals from 26 different populations in Africa, East Asia, South Asia, Europe, and the Americas.


The idea is to identify variants that are segregating in humans. Single nucleotide polymorphisms (SNPS) are difficult to identify because the error rate of sequencing is significant. When comparing a new genome sequence to the reference genome you don't know whether a single base change is due to sequencing error or a genuine variant unless you have a high quality sequence. Most of the 2,504 genome sequences are not of sufficiently high quality to be certain that the false positive rate is low but by sequencing multiple genomes it becomes feasible to identify variants that are shared by more that one individual within a population.

Recall that every human genome has about 100 new mutations so that even brothers and sisters will differ at 200 sites. The 1000 Genomes Consortium looks at the frequency of alleles in a population to determine whether the genetic variation is significant. They use a preliminary cutoff of 0.5%, which means that a variant (mutation) has to be present in 5 out of 1000 genomes in order to count as a variant that's segregating within the population. They estimate that 95% of SNPs meeting this threshold are true variants. For small insertions and deletions the accuracy is about 80%.

For variants at lower frequency, additional sequencing to a depth of >30X coverage was done and the putative variant was compared against other databases of genetic variation. The predicted accuracy of variants at 0.1% frequency is about 75%.

Given those limitations, the results of the studies are very informative. Looking at single base pair changes and small indels (insertions and deletions), the typical human genome (yours and mine) differs from the standard reference genome at about 4.5 million sites. That's about 0.14% of our genomes. Humans and chimpanzees differ by about 1.4% or ten times more.

SNPs and small indels account for 99.9% of variants. The others are "structural variants" consisting of; large deletions, copy number variants, Alu insertions, LINE L1 insertions, other transposon insertions, mitochondrial DNA insertions (NUMTS), and inversions. The typical human genome has about 2,300 of these structural variants of which about 1000 are large deletions.

Most of these variants are in junk DNA regions but the typical human genome carries about 10-12,000 variants that affect the sequence of a protein. Many of these will be neutral and some of the ones that have a detrimental effects will be heterozygous and recessive. The average person has 24-30 variants that are associated with genetic disease. (These are known detrimental alleles. If you get your genome sequenced, you will learn that you carry about 30 harmful alleles that you can pass on to your children.)

The Consortium reports that the the typical genome has variants at about 500,000 sites mapping to untranslated regions of mRNA (UTRs), insulators, enhancers, and transcription factor binding sites. I assume they are using the ENCODE data here so we need to take it with a large grain of salt. Most of these sites are not biologically relevant.

As expected, common variants are distributed in populations all over the world. These are the result of mutations that arose several hundred thousand years ago and reached significant frequencies before the present-day populations separated. However, 86% of all variants are restricted to a single continental group. These are the result of mutations that occurred after the present-day populations split.

The African populations contain more genetic variation than the Asian and European populations. Again, this is is expected since the European and Asian groups split from within the African group after Africans had been evolving on that continent for thousands of years. The differences are not great—Africans differ at about 4.3 million SNPs while the typical Europeans and Asian differ at only 3.5 million SNPs.

Only a small number of loci show evidence of selective sweeps, or recent selection (adaptation). It indicates that most of the differences between local ethnic groups are not associated with adaptation. The exceptions are SLC24A5 (skin pigmentation), HERC2 (eye color), LCT (lactose tolerance), and FADS (fat metabolism).


Sudmant, P.H., Rausch, T., Gardner, E.J., Handsaker, R.E., Abyzov, A., Huddleston, J., Zhang, Y., Ye, K., Jun, G., Hsi-Yang Fritz, M., Konkel, M.K., Malhotra, A., Stutz, A.M., Shi, X., Paolo Casale, F., Chen, J., Hormozdiari, F., Dayama, G., Chen, K., Malig, M., Chaisson, M.J. P., Walter, K., Meiers, S., Kashin, S., Garrison, E., Auton, A., Lam, H.Y.K., Jasmine Mu, X., Alkan, C., Antaki, D., Bae, T., Cerveira, E., Chines, P., Chong, Z., Clarke, L., Dal, E., Ding, L., Emery, S., Fan, X., Gujral, M., Kahveci, F., Kidd, J.M., Kong, Y., Lameijer, E.-W., McCarthy, S., Flicek, P., Gibbs, R.A., Marth, G., Mason, C.E., Menelaou, A., Muzny, D.M., Nelson, B.J., Noor, A., Parrish, N.F., Pendleton, M., Quitadamo, A., Raeder, B., Schadt, E.E., Romanovitch, M., Schlattl, A., Sebra, R., Shabalin, A.A., Untergasser, A., Walker, J.A., Wang, M., Yu, F., Zhang, C., Zhang, J., Zheng-Bradley, X., Zhou, W., Zichner, T., Sebat, J., Batzer, M.A., McCarroll, S.A., The Genomes Project, C., Mills, R.E., Gerstein, M.B., Bashir, A., Stegle, O., Devine, S.E., Lee, C., Eichler, E.E., and Korbel, J.O. (2015) An integrated map of structural variation in 2,504 human genomes. Nature, 526(7571), 75-81. [doi: 10.1038/nature15394]

The Genomes Project Consortium (2015) A global reference for human genetic variation. Nature, 526(7571), 68-74. [doi: 10.1038/nature15393]

Thursday, October 01, 2015

How many RNA molecules per cell are needed for function?

One of the issues in the junk DNA wars is the importance of all those RNAs that are detected in sensitive assays. About 90% of the human genome is complementary to RNAs that are made at some time in some tissue or other. Does this pervasive transcription mean that most of the genome is functional or are most of these transcripts just background noise due to accidental transcription?

Sunday, September 06, 2015

Constructive Neutral Evolution (CNE)

Constructive Neutral Evolution (CNE) is a term that describes the evolution of complex systems by non-adaptive mechanisms. The idea (and the name) was developed by Arlin Soltzfus in 1999 (Stoltzfus, 1999) but it has antecedents in the literature and in the environment where Stoltzfus did his post-doc (Michael Gray and Ford Doolittle). It has been promoted by a number of prominent evolutionary biologists/population geneticists, notably Michael Lynch in his book The Origins of Genome architecture. Several examples have been described and discussed in the scientific literature and in popular books. For example, there is good reason to think that the evolution of the complex spliceosome that removes introns has evolved by mainly non-adaptive evolution.

Ford Doolittle and Michael Gray are fans of constructive neutral evolution. They and their collaborators wrote a review of the idea in Science (Gray et al., 2010). It has the provocative title "Irremediable Complexity." The same authors (different order) published another review the following year (Lukeš et al., 2011).

It's important to understand this concept because it challenges the idea that the evolution of complexity is adaptive and it sets the stage for challenging the idea that all adaptive structures arose exclusively by natural selection. Almost everyone who writes about constructive neutral evolution understands that it poses a problem for those who cling to adapatationist or selectionist views of evolution. It also helps us understand why the core idea behind irreducible complexity has been refuted.

Wednesday, August 26, 2015

Eukaryotic genes come from alphaproteobacteria, cynaobacteria, and two groups of Archaea

Bill Martin and a group of collaborators from several countries have analyzed gene trees from a wide variety of species (Ku et al., 2015). They looked at the phylogenies of 2500 different genes with representatives in both prokaryotes and eukaryotes.

The goal of this massive project was to find out if you could construct reliable consensus trees of prokaryotes and eukaryotes given that lateral gene transfer (LGT)1 is so common.

The results show that LGT is very common in prokaryotes making it quite difficult to identify the evolutionary history of prokaryotic groups based on just a small number of gene trees.

In contrast, eukaryotes appear to be a monophyletic group where all living eukaryotes are descendants of a single ancestral species. There's very little LGT in eukaryotic lineages apart from one major event in algae and plants (see below).

The genes currently found in eukaryotic genomes show that eukaryotes arose from an endosymbiotic event where a primitive alphabacterium fused with a primitive archaebacterium. The remnant of the alphaproteobacterium genome are still present in mitochondria but the majority of the bacterial genes have merged with archaebacterial genes in the nuclear chromosomes. Thus, eukaryotes are hybrids formed from two distantly related prokaryotic species.

A second round of new genes was acquired in eukaryotes when a primitve single-cell species merged with a species of cyanobacterium. The remnant of the cyanobactrial genome is found in chloroplasts but, like the case with alphaproteobacteria, the majority of the cyanobacterial genes merged with other genes in the nuclear genome.

The exact number of trees was 2,585. Among those trees, 49% of eukaryotic genes cluster with proteobacteria, 38% derive from cynaobacterial ancestors, and only 13% come from the archaebacterial ancestor. Thus, it's fair to say that the dominant ancestor of eukaryotes, in terms of genetic contribution, is bacterial, not archaeal.

One of the authors on the paper is James O. McInerney of the National University of Ireland, in Maynooth, County Kildare, Ireland. He made a short video that explains the result.2



1. Also known as horizontal gene transfer (HGT).

2. I hate to contaminate a scientific post by referring to creationists but I can't help but wonder how they explain this data. I'd love it if some Intelligent Design Creationist could describe how this fits in with their understanding of the history of life.

Ku, C., Nelson-Sathi, S., Roettger, M., Sousa, F.L., Lockhart, P.J., Bryant, D., Hazkani-Covo, E., McInerney, J.O., Landan, G., Martin, W.F. (2015) Endosymbiotic origin and differential loss of eukaryotic genes. Nature Published online Aug. 19, 2015 [doi: 10.1038/nature14963]

Sunday, August 23, 2015

How do Intelligent Design Creationists deal with pseudogenes and false claims?

Some of the people who comment here have pointed out that this is the second anniversary of a post by Jonathan McLatchie on Evolution News & Views (sic): A Simple Proposed Model For Function of the Human Vitamin C GULO Pseudogene.

That post is significant for several reasons. Let's review a bit of background.

Intelligent Design Creationists have a problem with pseudogenes. Recall that pseudogenes are stretches of DNA that resemble a gene but they appear to be non-functional because they have acquired disruptive mutations, or because they were never functional to begin with (e.g. processed pseudogenes). All genomes contain pseudogenes. The human genome has more than 15,000 recognizable pseudogenes.1 This is not what you would expect from an intelligent designer so the ID crowd tries to rationalize the existence of pseudogenes by proposing that they have an unknown function.

Tuesday, August 11, 2015

Four things that Francis Collins learned from sequencing the human genome

I've been doing a bit of research on the human genome in preparation for a book. This led me to an article published in 2003 by Francis Collins, former head of the Human Genome Consortium (Collins, 2003). It's mostly about how he deals with science and religion but there was an interesting description of what he learned from completing the human genome sequence.

Here's what he said ....
We discovered some pretty surprising things in reading out the human genome sequence. Here are four highlights.

1. Humans have fewer genes than expected. My definition of a gene here—because different people use different terminology—is a stretch of DNA that codes for a particular protein. There are probably stretches of DNA that code for RNAs that do not go on to make proteins. That understanding is only now beginning to emerge and may be fairly complicated. But the standard definition of “a segment of DNA that codes for a protein” gives one a surprisingly small number of about 30,000 for the number of human genes. Considering that we’ve been talking about 100,000 genes for the last fifteen years (that’s what most of the textbooks still say), this was a bit of a shock. In fact, some people took it quite personally. I think they were particularly distressed because the gene count for some other simpler organisms had been previously determined. After all, a roundworm has 19,000 genes, and mustard weed has 25,000 genes, and we only have 30,000? Does that seem fair? Even worse, when they decoded the genome of the rice, it looks as if rice has about 55,000 genes. So you need to have more respect for dinner tonight! What does that mean? Surely, an alien coming from outer space looking at a human being and looking at a rice plant would say the human being is biologically more complex. I don’t think there’s much doubt about that. So gene count must not be the whole story. So what is going on?

2. Human genes make more proteins than those of other critters. One of the things going on is that we begin to realize that one gene does not just make one protein in humans and other mammals. On the average, it makes about three, using the phenomenon of alternative splicing to create proteins with different architectures. One is beginning to recover some sense of pride here in our genome, which was briefly under attack, because now we can say, “Well, we don’t have very many genes but boy are they clever genes. Look what they can do!”

3. The male mutation rate is twice that of females. We also discovered that simply by looking at the Y chromosome and comparing it to the rest of the genome—of course, the Y chromosome only passes from fathers to sons, so it only travels through males—you can get a fix on the mutation rate in males compared to females. This was not particularly good news for the boys in this project because it seems that we make mistakes about twice as often as the women do in passing our DNA to the next generation. That means, guys, we have to take responsibility for the majority of genetic disease. It has to start somewhere; the majority of the time, it starts in us. If you are feeling depressed about that, let me also point out we can take credit for the majority of evolutionary progress, which after all is the same phenomenon.

4. “Junk” DNA may not be junk after all. I have been troubled for a long time about the way in which we dismissed about 95% of the genome as being junk because we didn’t know what its function was. We did not think it had one because we had not discovered one yet. I found it quite gratifying to discover that when you have the whole genome in front of you, it is pretty clear that a lot of the stuff we call “junk” has the fingerprints of being a DNA sequence that is actually doing something, at least, judging by the way evolution has treated it. So I think we should probably remove the term “junk” from the genome. At least most of it looks like it may very well have some kind of function.

Monday, August 10, 2015

Insulators, junk DNA, and more hype and misconceptions

The folks at Evolution News & Views (sic) can serve a very useful purpose. They are constantly scanning the scientific literature for any hint of evidence to support their claim about junk DNA. Recall that Intelligent Design Creationists have declared that if most of our genome is junk then intelligent design is falsified since one of the main predictions of intelligent design is that most of our genome will be functional.

THEME

Genomes & Junk DNA
They must be getting worried because their most recent posts sounds quite desperate. The last one is: The Un-Junk Industry. It quotes a popular press report on a paper published recently in Procedings of the National Academy of Sciences (USA). The creationists concede that the paper itself doesn't even mention junk DNA but the article in EurekAlert does.

Friday, August 07, 2015

How to write about RNA

I find it very frustrating to read reports about RNA these days because the writers almost always misrepresent the history of the field and exaggerate the significance of recent discoveries. An article in the July 23, 2015 issue of Nature illustrates the problem. The article is written by Elie Dolgin (@ElieDolgin), a freelance science journalist based in Massachusetts (USA). He graduated from McGill University (Montreal, Quebec, Canada) with a degree in biology and obtained a Ph.D. in genetics and evolution from the University of Edinburgh (Edinburgh, Scotland, UK).

Thursday, July 30, 2015

The next step in genomics

The draft sequence of the human genome was published in 2001. The "finished" version was published a few years later but annotation continues.

A massive amount of data on complex genomes has been published, especially on the human genome. The next step is to decide what this data means. Here are the most important questions from my perspective.

Monday, July 27, 2015

More confusion about the central dogma of molecular biology

I was doing some reading on lncRNAs (long non-coding RNAs) in order to find out how many of them had been assigned real biological functions. My reading was prompted by the one of the latest updates to the human genome sequence; namely, assembly GRCh38.p3 from June 2015. The Ensembl website lists 14,889 lncRNA genes but I'm sure that most of these are just speculative [Ensembl Whole Genome].

The latest review by my colleagues here in the biochemistry department at the University of Toronto (Toronto, Canada), concludes that only a small fraction of these putative lncRNAs have a function (Palazzo and Lee, 2015). They point out that in the absence of evidence for function, the null hypothesis is that these RNAs are junk and the genes don't exist. That's not the view that annotators at Ensembl take.

I stumbled across a paper by Ling et al. (2015) that tries to make a case for function. I don't think their case is convincing but that's not what I want to discuss. I want to discuss their view of the Central Dogma of Molecular Biology. Here's the abstract ...
The central dogma of molecular biology states that the flow of genetic information moves from DNA to RNA to protein. However, in the last decade this dogma has been challenged by new findings on non-coding RNAs (ncRNAs) such as microRNAs (miRNAs). More recently, long non-coding RNAs (lncRNAs) have attracted much attention due to their large number and biological significance. Many lncRNAs have been identified as mapping to regulatory elements including gene promoters and enhancers, ultraconserved regions and intergenic regions of protein-coding genes. Yet, the biological function and molecular mechanisms of lncRNA in human diseases in general and cancer in particular remain largely unknown. Data from the literature suggest that lncRNA, often via interaction with proteins, functions in specific genomic loci or use their own transcription loci for regulatory activity. In this review, we summarize recent findings supporting the importance of DNA loci in lncRNA function and the underlying molecular mechanisms via cis or trans regulation, and discuss their implications in cancer. In addition, we use the 8q24 genomic locus, a region containing interactive SNPs, DNA regulatory elements and lncRNAs, as an example to illustrate how single nucleotide polymorphism (SNP) located within lncRNAs may be functionally associated with the individual’s susceptibility to cancer.
This is getting to be a familiar refrain. I understand how modern scientists might be confused about the difference between the Watson and the Crick versions of the Central Dogma [see The Central Dogma of Molecular Biology]. Many textbooks perpetuate the myth that Crick's sequence hypothesis is actually the Central Dogma. That's bad enough but lots of researchers seem to think that their false view of the Central Dogma goes even further. They think it means that the ONLY kind of genes in your genome are those that produce mRNA and protein.

I don't understand how such a ridiculous notion could arise but it must be a common misconception, otherwise why would these authors think that non-coding RNAs are a challenge to the Central Dogma? And why would the reviewers and editors think this was okay?

I'm pretty sure that I've interpreted their meaning correctly. Here's the opening sentences of the introduction to their paper ...
The Encyclopedia of DNA Elements (ENCODE) project has revealed that at least 75% of the human genome is transcribed into RNAs, while protein-coding genes comprise only 3% of the human genome. Because of a long-held protein-centered bias, many of the genomic regions that are transcribed into non-coding RNAs (ncRNAs) had been viewed as ‘junk’ in the genome, and the associated transcription had been regarded as transcriptional ‘noise’ lacking biological meaning.
They think that the Central Dogma is a "protein-centered bias." They think the Central Dogma rules out genes that specify noncoding RNAs. (Like tRNA and ribosomal RNA?)

Later on they say ....
The protein-centered dogma had viewed genomic regions not coding for proteins as ‘junk’ DNA. We now understand that many lncRNAs are transcribed from ‘junk’ regions, and even those encompassing transposons, pseudogenes and simple repeats represent important functional regulators with biological relevance.
It's simply not true that scientists in the past viewed all noncoding DNA as junk, at least not knowledgeable scientists [What's in Your Genome?]. Furthermore, no knowledgeable scientists ever interpreted the Central Dogma of Molecular Biology to mean that the only functional genes in a genome were those that encoded proteins.

Apparently Lee, Vincent, Picler, Fodde, Berindan-Neagoe, Slack, and Calin knew scientists who DID believe such nonsense. Maybe they even believed it themselves.

Judging by the frequency with with such statements appear in the scientific literature, I can only assume that this belief is widespread among biochemists and molecular biologists. How in the world did this happen? How many Sandwalk readers were taught that the Central Dogma rules out all genes for noncoding RNAs? Did you have such a protein-centered bias about the role of genes? Who were your teachers?

Didn't anyone teach you who won the Nobel Prize in 1989? Didn't you learn about snRNAs? What did you think RNA polymerases I and III were doing in the cell?


Ling, H., Vincent, K., Pichler, M., Fodde, R., Berindan-Neagoe, I., Slack, F.J., and Calin, G.A. (2015) Junk DNA and the long non-coding RNA twist in cancer genetics. Oncogene (published online January 26, 2015) [PDF]

Palazzo, A.F. and Lee, E.S. (2015) Non-coding RNA: what is functional and what is junk? Frontiers in genetics 6: 2 (published online January 26, 2015 [Abstract]

Friday, July 24, 2015

John Parrington talks about The Deeper Genome

Here's a video from Oxford Press where you can hear John Parrington describe some of the ideas in his book: The Deeper Genome: Why there is more to the human genome than meets the eye.



John Parrington discusses genome sequence conservation

John Parrington has written a book called, The Deeper Genome: Why there is more to the human genome than meets the eye. He claims that most of our genome is functional, not junk. I'm looking at how his arguments compare with Five Things You Should Know if You Want to Participate in the Junk DNA Debate

There's one post for each of the five issues that informed scientists need to address if they are going to write about the amount of junk in you genome. This is the last one.

1. Genetic load
John Parrington and the genetic load argument
2. C-Value paradox
John Parrington and the c-value paradox
3. Modern evolutionary theory
John Parrington and modern evolutionary theory
4. Pseudogenes and broken genes are junk
John Parrington discusses pseudogenes and broken genes
5. Most of the genome is not conserved (this post)
John Parrington discusses genome sequence conservation

5. Most of the genome is not conserved

There are several places in the book where Parrington address the issue of sequence conservation. The most detailed discussion is on pages 92-95 where he discusses the criticisms leveled by Dan Graur against ENCODE workers. Parrington notes that 9% of the human genome is conserved and recognizes that this is a strong argument for function. It implies that >90% of our genome is junk.

Here's how Parrington dismisses this argument ...
John Mattick and Marcel Dinger ... wrote an article for the HUGO Jounral, official journal of the Human Genome Organisation, entitled "The extent of functionality in the human genome." ... In response to the accusation that the apparent lack of sequence conservation of 90 per cent of the genome means that it has no function, Mattick and Dinger argued that regulatory elements and noncoding RNAs are much more relaxed in their link between structure and function, and therefore much harder to detect by standard measures of function. This could mean that 'conservation is relative', depending on the type of genomic structure being analyzed.
In other words, a large part of our genome (~70%?) could be producing functional regulatory RNAs whose sequence is irrelevant to their biological function. Parrington then writes a full page on Mattick's idea that the genome is full of genes for regulatory RNAs.

The idea that 90% of our genome is not conserved deserves far more serious treatment. In the next chapter (Chapter 7), Parrington discusses the role of RNA in forming a "scaffold" to organize DNA in three dimensions. He notes that ...
That such RNAs, by virtue of their sequence but also their 3D shape, can bind DNA, RNA, and proteins, makes them ideal candidates for such a role.
But if the genes for these RNAs make up a significant part of the genome then that means that some of their sequences are important for function. That has genetic load implications and also implications about conservation.

If it's not a "significant" fraction of the genome then Parrington should make that clear to his readers. He knows that 90% of our genome is not conserved, even between individuals (page 142), and he should know that this is consistent with genetic load arguments. However, almost all of his main arguments against junk DNA require that the extra DNA have a sequence-specific function. Those facts are not compatible. Here's how he justifies his position ...
Those proposing a higher figure [for functional DNA] believe that conservation is an imperfect measure of function for a number of reasons. One is that since many non-coding RNAs act as 3D structures, and because regulatory DNA elements are quite flexible in their sequence constraints, their easy detection by sequence conservation methods will be much more difficult than for protein-coding regions. Using such criteria, John Mattick and colleagues have come up with much higher figures for the amount of functionality in the genome. In addition, many epigenetic mechanisms that may be central for genome function will not be detectable through a DNA sequence comparison since they are mediated by chemical modifications of the DNA and its associated proteins that do not involve changes in DNA sequence. Finally, if genomes operate as 3D entities, then this may not be easily detectable in terms of sequence conservation.
This book would have been much better if Parrington had put some numbers behind his speculations. How much of the genome is responsible for making functional non-coding RNAs and how much of that should be conserved in one way of another? How much of the genome is devoted to regulatory sequences and what kind of sequence conservation is required for functionality? How much of the genome is required for "epigenetic mechanisms" and how do they work if the DNA sequence is irrelevant?

You can't argue this way. More than 90% of our genomes is not conserved—not even between individuals. If a good bit of that DNA is, nevertheless, functional, then those functions must not have anything to do with the sequence of the genome at those specific sites. Thus, regions that specify non-coding RNAs, for example, must perform their function even though all the base pairs can be mutated. Same for regulatory sequences—the actual sequence of these regulatory sequences isn't conserved according to John Parrington. This requires a bit more explanation since it flies on the face of what we know about function and regulation.

Finally, if you are going to use bulk DNA arguments to get around the conflict then tell us how much of the genome you are attributing to formation of "3D entities." Is it 90%? 70%? 50%?


John Parrington discusses pseudogenes and broken genes

We are discussing Five Things You Should Know if You Want to Participate in the Junk DNA Debate and how they are described in John Parrington's book The Deeper Genome: Why there is more to the human genome than meets the eye. This is the fourth of five posts.

1. Genetic load
John Parrington and the genetic load argument
2. C-Value paradox
John Parrington and the c-value paradox
3. Modern evolutionary theory
John Parrington and modern evolutionary theory
4. Pseudogenes and broken genes are junk (this post)
John Parrington discusses pseudogenes and broken genes
5. Most of the genome is not conserved
John Parrington discusses genome sequence conservation

4. Pseudogenes and broken genes are junk

Parrington discusses pseudogenes at several places in the book. For example, he mentions on page 72 that both Richard Dawkins and Ken Miller have used the existence of pseudogenes as an argument against intelligent design. But, as usual, he immediately alerts his readers to other possible explanations ...
However, using the uselessness of so much of the genome for such a purpose is also risky, for what if the so-called junk DNA turns out to have an important function, but one that hasn't yet been identified.
This is a really silly argument. We know what genes look like and we know what broken genes look like. There are about 20,000 former protein-coding pseudogenes in the human genome. Some of them arose recently following a gene duplication or insertion of a cDNA copy. Some of them are ancient and similar pseudogenes are found at the same locations in other species. They accumulate mutations at a rate consistent with neutral theory and random genetic drift. (This is a demonstrated fact.)

It's ridiculous to suggest that a significant proportion of those pseudogenes might have an unknown important function. That doesn't rule out a few exceptions but, as a general rule, if it looks like a broken gene and acts like a broken gene, then chances are pretty high that it's a broken gene.

As usual, Parrington doesn't address the big picture. Instead he resorts to the standard ploy of junk DNA proponents by emphasizing the exceptions. He devotes more that two full pages (pages 143-144) to evidence that some pseudogenes have acquired a secondary function.
The potential pitfalls of writing off elements in the genome as useless or parasitical has been demonstrated by a recent consideration of the role of pseudgogenes. ... recent studies are forcing a reappraisal of the functional role of these 'duds."
Do you think his readers understand that even if every single broken gene acquired a new function that would still only account for less than 2% of the genome?

There's a whole chapter dedicated to "The Jumping Genes" (Chapter 8). Parrington notes that 45% of our genome is composed of transposons (page 119). What are they doing in our genome? They could just be parasites (selfish DNA), which he equates with junk. However, Parrrington prefers the idea that they serve as sources of new regulatory elements and they are important in controlling responses to environmental pressures. They are also important in evolution.

As usual, there's no discussion about what fraction of the genome is functional in this way but the reader is left with the impression that most of that 45% may not be junk or parasites.

Most Sandwalk readers know that almost all of the transposon-related sequences are bits and pieces of transposons that haven't bee active for millions of years. They are pseudogenes. They look like broken transposon genes, they act like broken genes, and they evolve like broken transposons. It's safe to assume that's what they are. This is junk DNA and it makes up almost half of our genome.

John Parrington never mentions this nasty little fact. He leaves his readers with the impression that 45% of our genome consists of active transposons jumping around in our genome. I assume that this is what he believes to be true. He has not read the literature.

Chapter 9 is about epigenetics. (You knew it was coming, didn't you?) Apparently, epigentic changes can make the genome more amenable to transposition. This opens up possible functional roles for transposons.
As we've seen, stress may enhance transposition and, intriguingly, this seems to be linked to changes in the chromatin state of the genome, which permits repressed transposons to become active. It would be very interesting if such a mechanism constituted a way for the environment to make a lasting, genetic mark. This would be in line with recent suggestions that an important mechanism of evolution is 'genome resetting'—the periodic reorganization of the genome by newly mobile DNA elements, which establishes new genetic programs in embryo development. New evidence suggests that such a mechanism may be a key route whereby new species arise, and may have played an important role in the evolution of humans from apes. This is very different from the traditional view of evolution being driven by the gradual accumulation of mutations.
It was at this point, on page 139, that I realized I was dealing with a scientist who was in way over his head.

Parrington returns to this argument several times in his book. For example, in Chapter 10 ("Code, Non-code, Garbage, and Junk") he says ....
These sequences [transpsons] are assumed to be useless, and therefore their rate of mutation is taken to taken to represent a 'neutral' reference; however, as John Mattick and his colleague Marcel Dinger of the Garvan Institute have pointed out, a flaw in such reasoning is 'the questionable proposition that transposable elements, which provide the major source of evolutionary plasticity and novelty, are largely non-functional. In fact, as we saw in Chapter 8, there is increasing evidence that while transposons may start off as molecular parasites, they can also play a role in the creation of new regulatory elements, non-coding RNAs, and other such important functional components of the genome. It is this that has led John Stamatoyannopoulos to conclude that 'far from being an evolutionary dustbin, transposable elements appear to be active and lively members of the genomic regulatory community, deserving of the same level of scrutiny applied to other genic or regulatory features. In fact, the emerging role for transposition in creating new regulatory mechanisms in the genome challenges the very idea that we can divide the genome into 'useful' and 'junk' coomponents.
Keep in mind that active transposons represent only a tiny percentage of the human genome. About 50% of the genome consists of transposon flotsam and jetsam—bits and pieces of broken transposons. It looks like junk to me.

Why do all opponents of junk DNA argue this way without putting their cards on the table? Why don't they give us numbers? How much of the genome consists of transposon sequences that have a biological function? Is it 50%, 20%, 5%?


John Parrington and modern evolutionary theory

We are continuing our discussion of John Parrington's book The Deeper Genome: Why there is more to the human genome than meets the eye. This is the third of five posts on: Five Things You Should Know if You Want to Participate in the Junk DNA Debate

1. Genetic load
John Parrington and the genetic load argument
2. C-Value paradox
John Parrington and the c-value paradox
3. Modern evolutionary theory (this post)
John Parrington and modern evolutionary theory
4. Pseudogenes and broken genes are junk
John Parrington discusses pseudogenes and broken genes
5. Most of the genome is not conserved
John Parrington discusses genome sequence conservation

3. Modern evolutionary theory

You can't understand the junk DNA debate unless you've read Michael Lynch's book The Origins of Genome Architecture. That means you have to understand modern population genetics and the role of random genetic drift in the evolution of genomes. There's no evidence in Parrington's book that he has read The Origins of Genome Architecture and no evidence that he understands modern evolutionary theory. The only evolution he talks about is natural selection (Chapter 1).

Here's an example where he demonstrates adaptationist thinking and the fact that he hasn't read Lynch's book ...
At first glance, the existence of junk DNA seems to pose another problem for Crick's central dogma. If information flows in a one-way direction from DNA to RNA to protein, then there would appear to be no function for such noncoding DNA. But if 'junk DNA' really is useless, then isn't it incredibly wasteful to carry it around in our genome? After all, the reproduction of the genome that takes place during each cell division uses valuable cellular energy. And there is also the issue of packaging the approximately 3 billion base pairs of the human genome into the tiny cell nucleus. So surely natural selection would favor a situation where both genomic energy requirements and packaging needs are reduced fiftyfold?1
Nobody who understands modern evolutionary theory would ask such a question. They would have read all the published work on the issue and they would know about the limits of natural selection and why species can't necessarily get rid of junk DNA even if it seems harmful.

People like that would also understand the central dogma of molecular biology.


1. He goes on to propose a solution to this adaptationist paradox. Apparently, most of our genome consists of parasites (transposons), an idea he mistakenly attributes to Richard Dawkins' concept of The Selfish Gene. Parrington seems to have forgotten that most of the sequence of active transposons consists of protein-coding genes so it doesn't work very well as an explanation for excess noncoding DNA.

John Parrington and the C-value paradox

We are discussing John Parrington's book The Deeper Genome: Why there is more to the human genome than meets the eye. This is the second of five posts on: Five Things You Should Know if You Want to Participate in the Junk DNA Debate

1. Genetic load
John Parrington and the genetic load argument
2. C-Value paradox (this post)
John Parrington and the c-value paradox
3. Modern evolutionary theory
John Parrington and modern evolutionary theory
4. Pseudogenes and broken genes are junk
John Parrington discusses pseudogenes and broken genes
5. Most of the genome is not conserved
John Parrington discusses genome sequence conservation


2. C-Value paradox

Parrington addresses this issue on page 63 by describing experiments from the late 1960s showing that there was a great deal of noncoding DNA in our genome and that only a few percent of the genome was devoted to encoding proteins. He also notes that the differences in genome sizes of similar species gave rise to the possibility that most of our genome was junk. Five pages later (page 69) he reports that scientists were surprised to find only 30,000 protein-coding genes when the sequence of the human genome was published—"... the other big surprise was how little of our genomes are devoted to protein-coding sequence."

Contradictory stuff like that makes it every hard to follow his argument. On the one hand, he recognizes that scientists have known for 50 years that only 2% of our genome encodes proteins but, on the other hand, they were "surprised" to find this confirmed when the human genome sequence was published.

He spends a great deal of Chapter 4 explaining the existence of introns and claims that "over 90 per cent of our genes are alternatively spliced" (page 66). This seems to be offered as an explanation for all the excess noncoding DNA but he isn't explicit.

In spite of the fact that genome comparisons are a very important part of this debate, Parrington doesn't return to this point until Chapter 10 ("Code, Non-code, Garbage, and Junk").

We know that the C-Value Paradox isn't really a paradox because most of the excess DNA in various genomes is junk. There isn't any other explanation that makes sense of the data. I don't think Parrington appreciates the significance of this explanation.

The examples quoted in Chapter 10 are the lungfish, with a huge genome, and the pufferfish (Fugu), with a genome much smaller than ours. This requires an explanation if you are going to argue that most of the human genome is functional. Here's Parrington's explanation ...
Yet, despite having a genome only one eighth the size of ours, Fugu possesses a similar number of genes. This disparity raises questions about the wisdom of assigning functionality to the vast majority of the human genome, since, by the same token, this could imply that lungfish are far more complex than us from a genomic perspective, while the smaller amount of non-protein-coding DNA in the Fugu genome suggests the loss of such DNA is perfectly compatible with life in a multicellular organism.

Not everyone is convinced about the value of these examples though, John Mattick, for instance, believes that organisms with a much greater amount of DNA than humans can be dismissed as exceptions because they are 'polyploid', that is, their cells have far more than the normal two copies of each gene, or their genomes contain an unusually high proportion of inactive transposons.
In other words, organisms with larger genomes seem to be perfectly happy carrying around a lot of junk DNA! What kind of an argument is that?
Mattick is also not convinced that Fugu provides a good example of a complex organism with no non-coding DNA. Instead, he points out that 89% of this pufferfish's DNA is still non-protein-coding, so the often-made claim that this is an example of a multicellular organism without such DNA is misleading.
[Mattick has been] a true visionary in his field; he has demonstrated an extraordinary degree of perseverance and ingenuity in gradually proving his hypothesis over the course of 18 years.

Hugo Award Committee
Seriously? That's the best argument he has? He and Mattick misrepresent what scientists say about the pufferfish genome—nobody claims that the entire genome encodes proteins—then they ignore the main point; namely, why do humans need so much more DNA? Is it because we are polyploid?

It's safe to say that John Parrington doesn't understand the C-value argument. We already know that Mattick doesn't understand it and neither does Jonathan Wells, who also wrote a book on junk DNA [John Mattick vs. Jonathan Wells]. I suppose John Parrington prefers to quote Mattick instead of Jonathan Wells—even though they use the same arguments—because Mattick has received an award from the Human Genome Organization (HUGO) for his ideas and Wells hasn't [John Mattick Wins Chen Award for Distinguished Academic Achievement in Human Genetic and Genomic Research].

For further proof that Parrington has not done his homework, I note that the Onion Test [The Case for Junk DNA: The onion test ] isn't mentioned anywhere in his book. When people dismiss or ignore the Onion Test, it usually means they don't understand it. (For a spectacular example of such misunderstanding, see: Why the "Onion Test" Fails as an Argument for "Junk DNA").


Five things John Parrington should discuss if he wants to participate in the junk DNA debate

It's frustrating to see active scientists who think that most of our genome could have a biological function but who seem to be completely unaware of the evidence for junk. Most of the positive evidence for junk is decades old so there's no excuse for such ignorance.

I wrote a post in 2013 to help these scientists understand the issues: Five Things You Should Know if You Want to Participate in the Junk DNA Debate. It was based in a talk I gave at the Evolutionary Biology meeting in Chicago that year.1 Let's look at John Parrington's new book to see if he got the message [Hint: he didn't].

There's one post for each of the five issues that informed scientists need to address if they are going to write about the amount of junk in your genome.

1. Genetic load
John Parrington and the genetic load argument
2. C-Value paradox
John Parrington and the c-value paradox
3. Modern evolutionary theory
John Parrington and modern evolutionary theory
4. Pseudogenes and broken genes are junk
John Parrington discusses pseudogenes and broken genes
5. Most of the genome is not conserved
John Parrington discusses genome sequence conservation


1. It hasn't seemed to help very much.

John Parrington and the genetic load argument

We are discussing John Parrington's book The Deeper Genome: Why there is more to the human genome than meets the eye. This is the first of five posts on: Five Things You Should Know if You Want to Participate in the Junk DNA Debate

1. Genetic load (this post)
John Parrington and the genetic load argument
2. C-Value paradox
John Parrington and the c-value paradox
3. Modern evolutionary theory
John Parrington and modern evolutionary theory
4. Pseudogenes and broken genes are junk
John Parrington discusses pseudogenes and broken genes
5. Most of the genome is not conserved
John Parrington discusses genome sequence conservation


1. Genetic load

The genetic load argument has been around for 50 years. It's why experts did not expect a huge number of genes when the genome sequence was published. It's why the sequence of most of our genome must be irrelevant from an evolutionary perspective.

This argument does not rule out bulk DNA hypotheses but it does rule out all those functions that require specific sequences in order to confer biological function. This includes the speculation that most transcripts have a function and it includes the speculation that there's a vast amount of regulatory sequence in our genome. Chapter 5 of The Deeper Genome is all about the importance of regulatory RNAs.
So, starting from a failed attempt top turn a petunia purple, the discovery of RNA interference has revealed a whole new network of gene regulation mediated by RNAs and operating in parallel to the more established one of protein regulatory factors. ... Studies have revealed that a surprising 60 per cent of miRNAs turn out to be recycled introns, with the remainder being generated from the regions between genes. Yet these were parts of the genome formerly viewed as junk. Does this mean we need a reconsideration of this question? This is an issue we will discuss in Chapter 6, in particular with regard to the ENCODE project ...
The implication here is that a substantial part of the genome is devoted to the production of regulatory RNAs. Presumably, the sequences of those RNAs are important. But this conflicts with the genetic load argument unless we're only talking about an insignificant fraction of the genome.

But that's only one part of Parrington's argument against junk DNA. Here's the summary from the last Chapter ("Conclusion") ...
As we've discussed in this book, a major part of the debate about the ENCODE findings has focused on the question of what proportion of the genome is functional. Given that the two sides of this debate use quite different criteria to assess functionality it is likely that it will be some time before we have a clearer idea about who is the most correct in this debate. Yet, in framing the debate in this quantitative way, there is a danger that we might lose sight of an exciting qualitative shift that has been taking place in biology over the past decade or so. So a previous emphasis on a linear flow of information, from DNA to RNA to protein through a genetic code, is now giving way to a much more complex picture in which multiple codes are superimposed on one another. Such a viewpoint sees the gene as more than just a protein-coding unit; instead it can equally be seen as an accumulation of chemical modifications in the DNA or its associated histones, a site for non-coding RNA synthesis, or a nexus in a 3D network. Moreover, since we now know that multiple sites in the genome outside the protein-coding regions can produce RNAs, and that even many pseudo-genes are turning out to be functional, the very question of what constitutes a gene is now being challenged. Or, as Ed Weiss at the University of Pennsylvania recently put it, 'the concept of a gene is shredding.' Such is the nature of the shift that now we face the challenge of not just recognizing the true scale of this complexity, but explaining how it all comes together to make a living, functioning, human being.
I've already addressed some of the fuzzy thinking in this paragraph [The fuzzy thinking of John Parrington: The Central Dogma and The fuzzy thinking of John Parrington: pervasive transcription]. The point I want to make here is that Parrington's arguments for function in the genome require a great deal of sequence information. They all conflict with the genetic load argument.

Parrington doesn't cover the genetic load argument at all in his book. I don't know why since it seems very relevant. We could not survive as a species if the sequence of most of our genome was important for biological function.


Sunday, July 19, 2015

The fuzzy thinking of John Parrington: pervasive transcription

Opponents of junk DNA usually emphasize the point that they were surprised when the draft human genome sequence was published in 2001. They expected about 100,000 genes but the initial results suggested less than 30,000 (the final number is about 25,0001. The reason they were surprised was because they had not kept up with the literature on the subject and they had not been paying attention when the sequence of chromosome 22 was published in 1999 [see Facts and Myths Concerning the Historical Estimates of the Number of Genes in the Human Genome].

The experts were expecting about 30,000 genes and that's what the genome sequence showed. Normally this wouldn't be such a big deal. Those who were expecting a large number of genes would just admit that they were wrong and they hadn't kept up with the literature over the past 30 years. They should have realized that discoveries in other species and advances in developmental biology had reinforced the idea that mammals only needed about the same number of genes as other multicellular organisms. Most of the differences are due to regulation. There was no good reason to expect that humans would need a huge number of extra genes.

That's not what happened. Instead, opponents of junk DNA insist that the complexity of the human genome cannot be explained by such a low number of genes. There must be some other explanation to account for the the missing genes. This sets the stage for at least seven different hypotheses that might resolve The Deflated Ego Problem. One of them is the idea that the human genome contains thousands and thousands of nonconserved genes for various regulatory RNAs. These are the missing genes and they account for a lot of the "dark matter" of the genome—sequences that were thought to be junk.

Here's how John Parrington describes it on page 91 of his book.
The study [ENCODE] also found that 80 per cent of the genome was generating RNA transcripts having importance, many were found only in specific cellular compartments, indicating that they have fixed addresses where they operate. Surely there could hardly be a greater divergence from Crick's central dogma than this demonstration that RNAs were produced in far greater numbers across the genome than could be expected if they were simply intermediates between DNA and protein. Indeed, some ENCODE researchers argued that the basic unit of transcription should now be considered as the transcript. So Stamatoyannopoulos claimed that 'the project has played an important role in changing our concept of the gene.'
This passage illustrates my difficulty in coming to grips with Parrington's logic in The Deeper genome. Just about every page contains statements that are either wrong or misleading and when he strings them together they lead to a fundamentally flawed conclusion. In order to critique the main point, you have to correct each of the so-called "facts" that he gets wrong. This is very tedious.

I've already explained why Parrington is wrong about the Central Dogma of Molecular Biology [John Avise doesn't understand the Central Dogma of Molecular Biology]. His readers don't know that he's wrong so they think that the discovery of noncoding RNAs is a revolution in our understanding of biochemisty—a revolution led by the likes of John A. Stamatoyannopoulos in 2012.

The reference in the book to the statement by Stamatoyannopoulos is from the infamous Elizabeth Pennisi article on ENCODE Project Writes Eulogy for Junk DNA (Pennisi, 2012). Here's what she said in that article ...
As a result of ENCODE, Gingeras and others argue that the fundamental unit of the genome and the basic unit of heredity should be the transcript—the piece of RNA decoded from DNA—and not the gene. “The project has played an important role in changing our concept of the gene,” Stamatoyannopoulos says.
I'm not sure what concept of a gene these people had before 2012. It appears that John Parrington is under the impression that genes are units that encode proteins and maybe that's what Pennisi and Stamatoyannopoulos thought as well.

If so, then perhaps the publicity surrounding ENCODE really did change their concept of a gene but all that proves is that they were remarkably uniformed before 2012. Intelligent biochemists have known for decades that the best definition of a gene is "a DNA sequence that is transcribed to produce a functional product."2 In other words, we have been defining a gene in terms of transcripts for 45 years [What Is a Gene?].

This is just another example of wrong and misleading statements that will confuse readers. If I were writing a book I would say, "The human genome sequence confirmed the predictions of the experts that there would be no more than 30,000 genes. There's nothing in the genome sequence or the ENCODE results that has any bearing on the correct understanding of the Central Dogma and there's nothing that changes the correct definition of a gene."

You can see where John Parrington's thinking is headed. Apparently, Parrington is one of those scientists who were completely unaware of the fact that genes could specify functional RNAs and completely unaware of the fact that Crick knew this back in 1970 when he tried to correct people like Parrington. Thus, Parrington and his colleagues were shocked to learn that the human genome only had only 25,000 genes and many of them didn't encode proteins. Instead of realizing that his view was wrong, he thinks that the ENCODE results overthrew those old definitions and changed the way we think about genes. He tries to convince his readers that there was a revolution in 2012.

Parrington seems to be vaguely aware of the idea that most pervasive transcription is due to noise or junk RNA. However, he gives his readers no explanation of the reasoning behind such a claim. Spurious transcription is predicted because we understand the basic concept of transcription initiation. We know that promoter sequences and transcription binding sites are short sequences and we know that they HAVE to occur a high frequency in large genomes just by chance. This is not just speculation. [see The "duon" delusion and why transcription factors MUST bind non-functionally to exon sequences and How RNA Polymerase Binds to DNA]

If our understanding of transcription initiation is correct then all you need is a activator transcription factor binding site near something that's compatible with a promoter sequence. Any given cell type will contain a number of such factors and they must bind to a large number of nonfunctional sites in a large genome. Many of these will cause occasional transcription giving rise to low abundance junk RNA. (Most of the ENCODE transcripts are present at less than one copy per cell.)

Different tissues will have different transcription factors. Thus, the low abundance junk RNAs must exhibit tissue specificity if our prediction is correct. Parrington and the ENCODE workers seem to think that the cell specificity of these low abundance transcripts is evidence of function. It isn't—it's exactly what you expect of spurious transcription. Parrington and the ENCODE leaders don't understand the scientific literature on transription initiation and transcription factors binding sites.

It takes me an entire blog post to explain the flaws in just one paragraph of Parrington's book. The whole book is like this. The only thing it has going for it is that it's better than Nessa Carey's book [Nessa Carey doesn't understand junk DNA].


1. There are about 20,000 protein-encoding genes and an unknown number of genes specifying functional RNAs. I'm estimating that there are about 5,000 but some people think there are many more.

2. No definition is perfect. My point is that defining a gene as a DNA sequence that encodes a protein is something that should have been purged from textbooks decades ago. Any biochemist who ever thought seriously enough about the definition to bring it up in a scientific paper should be embarrassed to admit that they ever believed such a ridiculous definition.

Pennisi, E. (2012) "ENCODE Project Writes Eulogy for Junk DNA." Science 337: 1159-1161. [doi:10.1126/science.337.6099.1159"]