More Recent Comments

Thursday, July 28, 2016

False history and the number of genes: 2016

There's an article about junk DNA in the latest issue of New Scientist. The title is: You are junk: Why it’s not your genes that make you human. The author is Colin Barras, a science writer from Michigan with a Ph.D. in paleontology.

He begins with .....
IT WAS a discovery that threatened to overturn everything we thought about what makes us human. At the dawn of the new millennium, two rival teams were vying to be the first to sequence the human genome. Their findings, published in February 2001, made headlines around the world. Back-of-the-envelope calculations had suggested that to account for the sheer complexity of human biology, our genome should contain roughly 100,000 genes. The estimate was wildly off. Both groups put the actual figure at around 30,000. We now think it is even fewer – just 20,000 or so.

"It was a massive shock," says geneticist John Mattick. "That number is tiny. It’s effectively the same as a microscopic worm that has just 1000 cells."
There's more to the story but I'll leave that to another post. Right now I want to focus on the persistent, and false, meme about the "shocking discovery." Here are two previous posts on the subject.

False History and the Number of Genes 2010

Facts and Myths Concerning the Historical Estimates of the Number of Genes in the Human Genome

It is simply not true that knowledgeable experts in the field were surprised by the number of genes in the draft sequence of the human genome published in 2001. Most of these experts were well aware of previous published work in biochemistry and molecular biology. They knew about the genetic load arguments dating back to the 1940s.

The best estimates of the number of genes in the human genome had long been incorporated into the textbooks. Benjamin Lewin, chief editor of Cell was one of these experts. In his very popular textbook (Genes II) he concluded, in 1983, that there were 30,000 - 40,000 genes. Molecular Biology of the Cell was another popular textbook; its authors (Alberts et al. 1983) estimated that the human genome contained 30,000 genes.

The 2001 draft sequence estimated 30,000 genes. Who was shocked?

The experts predicted about 30,000 genes and that's exactly what was discovered. The most recent updates of the human genome reference sequence have about 25,000 genes or which 20,000 are protein-coding genes. So the facts of the story are wrong if you go by what the knowledgeable experts were saying before the human genome sequence was published.

... those ignorant of history are not condemned to repeat it; they are merely destined to be confused.

Stephen Jay Gould
Ontogeny and Phylogeny (1977)
It's true that there were many non-experts who had not studied the evidence back in 2001. They may have fallen for back-of-the-envelope guesses by other non-experts. But if you are going to make a point about the state of knowledge you don't quote the non-experts.

So, the facts are wrong. The experts were not shocked by the number of genes in the human genome. If that's true (it is) then what are we to make of the opening sentence of the New Scientist article? ...
IT WAS a discovery that threatened to overturn everything we thought about what makes us human.
That's false as well. Several decades of work before 2001 had shown us that the differences between species were not due to differences in the number of genes but to differences in how and when they were regulated. That was the state of knowledge back then and it's still the state of knowledge today. Nothing about the human genome sequence "threatened" anything we thought about what makes us human. Developmental biologists had essentially solved the problem in the 1980s.

The real problem here is that some scientists (cough! ... John Mattick ... cough!) suffer from The Deflated Ego Problem. They believe that humans are much more complex than other species so they were expecting us to have lots more genes. They were shocked when they learned that humans have about the same number of genes as other animals.

Whenever you read an article that begins with this false meme you can be certain that it's going to describe some solution to the "problem." There are seven common rationales uses to explain away the "shocking" discovery that we don't have many more genes than a fruit fly [Vertebrate Complexity Is Explained by the Evolution of Long-Range Interactions that Regulate Transcription?]. You know the article will use at least one of these arguments to cope with their deflated egos. Here's the list copied from a previous post.
1. Alternative Splicing: We may not have many more genes than a fruit fly but our genes can be rearranged in many different ways and this accounts for why we are much more complex. We have only 25,000 genes but through the magic of alternative splicing we can make 100,000 different proteins. That makes us almost ten times more complex than a fruit fly. (Assuming they don't do alternative splicing.)
2. Small RNAs: Scientists have miscalculated the number of genes by focusing only on protein encoding genes. Our genome actually contains tens of thousands of genes for small regulatory RNAs. These small RNA molecules combine in very complex ways to control the expression of the more traditional genes. This extra layer of complexity, not found in simple organisms, is what explains the Deflated Ego Problem.
3. Pseudogenes: The human genome contains thousands of apparently inactive genes called pseudogenes. Many of these genes are not extinct genes, as is commonly believed. Instead, they are genes-in-waiting. The complexity of humans is explained by invoking ways of tapping into this reserve to create new genes very quickly.
4. Transposons: The human genome is full of transposons but most scientists ignore them and don't count them in the number of genes. However, transposons are constantly jumping around in the genome and when they land next to a gene they can change it or cause it to be expressed differently. This vast pool of transposons makes our genome much more complicated than that of the simple species. This genome complexity is what's responsible for making humans more complex.
5. Regulatory Sequences: The human genome is huge compared to those of the simple species. All this extra DNA is due to increases in the number of regulatory sequences that control gene expression. We don't have many more protein-encoding regions but we have a much more complex system of regulating the expression of proteins. Thus, the fact that we are more complex than a fruit fly is not due to more genes but to more complex systems of regulation.
6. The Unspecified Anti-Junk Argument: We don't know exactly how to explain the Deflated Ego Problem but it must have something to do with so-called "junk" DNA. There's more and more evidence that junk DNA has a function. It's almost certain that there's something hidden in the extra-genic DNA that will explain our complexity. We'll find it eventually.
7. Post-translational Modification: Proteins can be extensively modified in various ways after they are synthesized. The modifications, such as phosphorylation, glycosylation, editing, etc., give rise to variants with different functions. In this way, the 25,000 primary protein products can actually be modified to make a set of enzymes with several hundred thousand different functions. That explains why we are so much more complicated than worms even though we have similar numbers of genes.


T Ryan Gregory said...

In the very next paragraph, he talks about how 30,000 was NOT a shock to evolutionary biologists because they had calculated a maximum of about that number based on limits imposed by mutation rate.

I know this, because I was interviewed for the story.

Unknown said...

I've been reading a lot about Drosophila larval salivary glands lately, and I came across a paper by Painter (1934; I think) where it's estimated that fruit flies might have something like 3000 genes, if one assumes a correlation between the number of bands visible on the polytene chromosomes and the number of loci. It's interesting, because it's such a low number, and because it was a viable hypothesis at that time! Bridges (1935) discusses a similar correlation in his landmark paper with the salivary chromosome maps. Using such historical data, 30000 genes is surprising -- but in the other direction!

Larry Moran said...

I discuss this in the next post.

Don't you think it's weird that he starts off saying how shocking it was to discover that humans had only 30,000 genes then, a few paragraphs later, he notes that the experts (evolutionary biologists) weren't shocked at all?

Then he goes on to imply that even the experts thought that all noncoding DNA (98%) was junk!

Larry Moran said...

Up until about 1980, it was widely believed that Drosophila melanogaster had about 5000 genes and about 5000 bands in polytene chromosomes.

By the end of the decade, it was clear there were more genes. When I wrote my genomes chapter in 1994, I estimated that fruit flies had 7,000-10,000 genes. It was a surprise to learn that they had MORE genes when the genome sequence was published! (The current number is 15,682.)

Most of my colleagues thought that humans wouldn't have a lot more genes than Drosophila so they were happy with estimates that were substantially less that 50,000 human genes and close to the old estimates of about 30,000 genes.

John Harshman said...

Did you happen to mention dogs' asses during the interview?

SPARC said...

I guess it is worth citing what Walter Gilbert wrote back in 1992:

"The DNA sequence has a simple numerical expression: it is composed of three billion base pairs. That is enough information to code for about 100,000 to 300,000 genes, each gene being a region of DNA that can specify a protein or some other structure that carries out a function in the organism. Nobody knows how many genes are really involved, because we do not know the average size of a gene in the human body. Our estimate of 100,000 assumes that there are about 30,000 base pairs per gene, which is a reasonably good guess. But many genes are only 10,000 base pairs long, so perhaps there are as many as 300,000. Many of the most interesting of those genes have multiple RNA splicing patterns, that is, the messenger RNA transcribed from a single gene may splice together different parts of the DNA sequence of the gene. The function of these patterns must be understood in order to study an individual human gene. So saying that a human is made up of 1,000 genes underestimates the complexity of the human being, because many of the gene may encode ten or twenty different function in different tissues."

Walter Gilbert (1992): A Vision of the Grail
In: Daniel J. Kevles und Leroy E. Hood (1992):
The Code of Codes: Scientific and Social Issues in the Human Genome Project
Harvard University Press

Larry Moran said...

Gilbert is the man who is largely responsible for spreading the idea that humans have 100,000 genes. You can see from this quote that it was an evidence-free estimate based on a simple calculation. The estimate assumed that the entire genome was composed of genes even though we knew at the time (1992) that half of it was transposons and bits of transposons. We also knew in 1992 that there was a lot of unique-sequence DNA between genes. (Lots of genomic DNA had been sequenced.)

Gilbert was wrong about the number of genes. He was also wrong when he said that many genes may encode ten or twenty different functions. Today, 24 years later, we have no evidence to support such a ridiculous claim. In fact, I know of only one gene that might possibly make more that 10 different proteins.

Unknown said...

This is unrelated to the blog post, but I just read the following paper and thought I'd ask T Ryan and Laurence what their thoughts were:

Given that this article and others have found that synonymous codons aren't always functionally neutral, how does that affect the validity of some phylogenetic analyses and judging the evolution of genes based on synonymous/non-synonymous mutations?

Thanks in advance.

T Ryan Gregory said...

@John Harshman -- Yes, I did. And the author asked for some raw data to plot a figure showing actual genome size vs. non-coding DNA content, which I provided. I believe he had intended some much more substantial non-Mattick revisions, but it was vetoed by the editors, given the intended thrust of the article.

Michael Tress said...

A bit of a late reply, but I just came across this comment and I thought it was interesting. I can think of four human genes that are likely to have more than 10 different proteins and one that we know has nine. But our results certainly suggest that this is rare. The genes likely to have 10+ proteins are TPM1, PLEC, and the PCDHA and PCDHG families (both bizarrely classified as "gene clusters" rather than a single gene by HGNC). The UGT1A "gene cluster" produces nine different proteins.