It's difficult to explain fundamental concepts of biology to the average person. That's why I'm so interested in Siddhartha Mukherjee's book "The Gene: an intimate history." It's a #1 bestseller so he must be doing something right.My working definition of a gene is based on a blog post from several years ago [What Is a Gene?].
A gene is a DNA sequence that is transcribed to produce a functional product.This covers two types of genes: those that eventually produce proteins (polypeptides); and those that produce functional noncoding RNAs. This distinction is important when discussing what's in our genome.
My definition of a gene, which is shared by many scientists, includes introns. In the case of protein-coding genes it includes the parts of the gene specifying untranslated sequences at the ends of an mRNA molecule (5′-UTRs and 3′-UTRs). Thus, protein-coding genes make up 25-30% of our genome. Most of that fraction is noncoding and most of it is junk. Coding regions are only 1.25% of the human genome. It is misleading to say that genes make up only 2% of our genome.
My definition of a gene does not include the regulatory regions that control gene expression. These sequences play an essential role in the genome and they are largely responsible for the differences between closely related species. Mutations in those non-gene regions often cause genetic diseases in humans—a major topic in Mukherjee's book.
His book has a short glossary where you can find the following definition of a gene (p. 499) ...
Gene: A unit of inheritance, normally comprised of a stretch of DNA that codes for a protein or for an RNA chain (in special cases, it might be carried in RNA form).This isn't as clear as I would like but maybe it's okay for a general audience. There are lots of units of inheritance that aren't genes—regulatory regions are prime examples—but maybe the average person doesn't need to know this? Unfortunately, the discussion in the main text doesn't stick to the glossary definition. It focuses almost exclusively on protein-coding genes. There are several figures based on a diagram like this ...
Crick was referring to the striking universality of the flow of genetic information throughout biology. From bacteria to elephants—from red-eyed flies to blue-blooded princes—biological information flowed through living systems in a systematic, archetypal manner: DNA provided instructions to build RNA. RNA provided instructions to build proteins. Proteins ultimately enabled structure and function—bringing genes to life.This is, of course, an incorrect description of Crick's Central Dogma1 but, more importantly, it's a restricted definition of genes and information. Is this misrepresentation excusable when writing for a general audience? Does the average reader need to know that there's information outside of genes and that some genes don't encode proteins?
Mukherjee doesn't ignore introns. He describes them on pages 219-220 but it's not clear that he considers them to be a part of a gene. What is clear is that he has bought into the idea that introns have a purpose. According to Mukherjee, introns allow shuffling of protein-coding regions to create "a vast number of variant messages—called isoforms—out of a single gene." This is consistent with the message in most of the book. Like many scientists, Mukherjee adores adaptive explanations. You will not find anything in this book that suggests evolution by accident.
This includes the DNA between genes. Mukherjee says they are there to regulate genes (p. 220).
You might be wondering if Mukherjee addresses the human genome and the controversy over junk DNA. The answer is "no." He doesn't give his readers much information on this topic. The relevant chapter2 is the one beginning on page 322 "The Book of Man (in Twenty-Three Volumes)."3 It's five pages of bullet points.
Let's look at some of them.
The actual size is 3.2 billion base pairs [How Big Is the Human Genome?]. The amount of DNA that has actually been sequenced and organized into scaffolds will depend on the build—the latest ones cover about 92% of the genome [How Much of Our Genome Is Sequenced?].
- It has 3,088,286,401 letters of DNA (give or take a few).
The exact number of bases isn't important to the average reader but if you are going to include it in your book shouldn't it be the correct value?
It sounds like Siddhartha Mukherjee might have a mild case of the The Deflated Ego Problem. Both humans and breakfast cereal have sophisticated gene networks so that's not really a significant difference. The difference is in how and when genes are expressed but also, in this case, in the types of genes in the genome. Humans and rice plants have thousands of different genes that aren't shared.
- It encodes about 20,687 genes in total—only 1,796 more than worms, 12,000 fewer than corn, and 25,000 fewer than rice or wheat. The difference between "human" and "breakfast cereal" is not a matter of gene numbers, but of sophistication of gene networks. It is not what we have; it is how we use it.
There are about 25,000 genes in the human genome.4 All mammals have about the same number of genes and they all have pretty much the same genes. The difference between whales, bats, elephants, and humans are largely due to differences in when and where developmental genes are expressed during embryogenesis. It's not what all these species have, it's how they use it that makes most of the difference. Humans are not special.
I think it's time to stop being surprised by the fact that some species might have more genes than we do and time to explain why some plants might have more genes. And it's time to stop saying that humans might have a more sophisticated way of controlling their genes. Non-experts5 might have been surprised by the low number of genes back in 2001 but that was 15 years ago. Get over it.
If your ego has been deflated by the fact that we don't have lots more genes than breakfast cereal, then you'd better come up with an explanation other than the fact that you just don't understand evolution. I listed the seven most common rationalizations. One of them is alternative splicing. Another is "sophisticated" and highly precise gene regulation. Mukherjee goes part way down the path of using some of these rationalizations to explain his disappointment at our low number of genes ....
This is false. First, we are not significantly more complex than whales, bats, and elephants and not more complex than fruit flies that can fly and can exist in two very different forms; adult and larva.
- It [the human genome] is fiercely inventive. It squeezes complexity out of simplicity. It orchestrates the activation or repression or certain genes in only certain cells and at certain times, creating unique contexts and partners for each gene in time and space, and thus produces near-infinite functional variation out of its limited repertoire. [all multicellular species do this - LAM] And it mixes and matches gene modules—called exons—within single genes to extract even further combinational diversity out of its gene repertoire. These two strategies—gene regulation and gene splicing—appear to be used more extensively in the human genome than in the genomes of most organisms. More than the enormity of gene numbers, the diversity of gene types, or the originality of gene function, it is the ingenuity of our genome that is the secret to our complexity.
Second, gene regulation in humans is no different than gene regulation in other species.
Third, alternative splicing exists but it only affects a small number of genes and, for the most part, those genes are also alternatively spliced in all other mammals. The idea that most human genes are alternatively spliced to produce different functional proteins is certainly false. And the idea that only humans can do this is even more false!
What about junk DNA? Here's another bullet point ...
Remember that expert scientists have known that most of our genome is junk for over 40 years. Isn't it time we stopped telling the general public that this is "odd" or "bewildering"?
- Genes, oddly, comprise only a minuscule fraction of it. An enormous proportion—a bewildering 98 percent—is not dedicated to genes per se, but to enormous stretches that are interspersed between genes (intergenic DNA) or within genes (introns). These long stretches encode no RNA [introns? - LAM], and no protein; they exist in the genome either because they regulate gene expression, or for reasons that we do not yet understand, or because of no reason whatsoever (i.e. they are "junk" DNA).
This is the only attempt at explaining junk DNA and the idea that much of our genome could be there for "no reason." I wonder what the average person thinks when they are told, once again, that genes make up only 2% of our genome. I bet they focus on the idea that much of the rest is devoted to regulation and that we just don't understand what else is going on. This is misleading.
It's 2016 and we know a lot about noncoding DNA and a lot about how much of our genome is junk. Isn't it time we explain this to the general public?
Why doesn't Siddhartha Mukherlee do this when he's got the chance?
I've been writing textbooks on biochemistry and molecular biology for 30 years and I've been reading textbooks for much longer than that. All those books contain plenty of information on the regulation of gene expression. We know a heck of a lot about transcription factors and DNA binding and we know a heck of a lot about why some genes are expressed in some cells and not in others.
- Although we fully understand the genetic code—i.e., how the information in a single gene is used to build a protein—we comprehend virutally nothing about the genetic code—i.e., how multiple genes spread across the human genome coordinate gene expression in space and time to build, maintain, and repair a human organism. The genetic code is simple: DNA is used to build RNA, and RNA is used to build a protein. A triplet of bases in DNA specifies one amino acid in a protein. The genomic code is complex: appended to a gene are sequences of DNA that carry information on when and where to express the gene. We do not know why certain genes are located in particular geographical locations in the genome, and how tracts of DNA that lie between genes regulate and coordinate gene physiology. There are codes beyond codes, like mountains beyond mountains. [my emphasis - LAM]
Why would Siddhartha Mukherjee give his readers the impression that this is a big mystery? Do you agree with him?
Three enormous projects lie ahead for human genetics. All three concern discrimination, division, and eventual reconstruction. The first is to discern the exact nature of information in the human genome. The Human Genome Project provided the starting point for this inquiry, but it raised a series of intriguing questions about what, precisely, is "encoded" by the 3 billion nucleotides of human DNA. What are the functional elements in the genome? There are protein-coding genes, of course—about twenty-one to twenty-four thousand in all—but also regulatory sequences of genes, and stretches of DNA (introns) that split genes into modules. There is information to build tens of thousands of RNA molecules that do not get translated into proteins but seem to perform diverse roles in cellular physiology. There are long highways of "junk" DNA that are unlikely to be junk after all and may encode hundreds of yet-unknown functions. There are kinks and folds that allow one part of the chromosome to associate with another in three-dimensional space.I think this is very misleading. Perhaps it's just a case of Mukherjee seeing the glass half empty whereas I see it as half full. He focuses on all the things we don't know whereas I think he giving short shrift to everything we do know.
To understand the role of each of these elements, a vast international project, launched in 2013 (sic), hopes to create a compendium of every functional element in the human genome—i.e., any part of any sequence in any chromosome that has a coding or instructional function. Ingeniously termed the Encyclopedia of DNA Elements (ENC-O-DE), this project will cross-annotate the sequence of the human genome against all the information contained within it.
Once these functional "elements" have been identified, biologists can move to the second challenge: understanding how the elements can be combined in time and space to enable human embryology and physiology, the specification of anatomical parts, and the development of an organism's features and characteristics. One humbling fact about our understanding of the human genome is how little we know of the human genome: much of our knowledge of our genes and their functions is inferred from similar-looking genes in yeast, worms, flies, and mice.
Is this all, or is there something else going on? Is it possible that Mukherjee doesn't know enough about genomes and gene regulation to have an informed opinion?
No matter what the reason, the public is being misinformed about the state of knowledge in biochemistry, molecular biology, developmental biology, and genomics. This book is being bought—and presumably read—by a huge number of people. Most reviews are glowing.
Some reviewers have enthusiastically embraced Mukherjee's point of view. For example, here's what Nathaniel Comfort wrote in The Atlantic [Genes Are Overrated].
Ironically, the more we study the genome, the more “the gene” recedes. A genome was initially defined as an organism’s complete set of genes. When I was in college, in the 1980s, humans had 100,000; today, only about 20,000 protein-coding genes are recognized. Those that remain are modular, repurposed, mixed and matched. They overlap and interleave. Some can be read forward or backward. The number of diseases understood to be caused by a single gene is shrinking; most genes’ effects on any given disease are small. Only about 1 percent of our genome encodes proteins. The rest is DNA dark matter. It is still incompletely understood, but some of it involves regulation of the genome itself. Some scientists who study non-protein-coding DNA are even moving away from the gene as a physical thing. They think of it as a “higher-order concept” or a “framework” that shifts with the needs of the cell. The old genome was a linear set of instructions, interspersed with junk; the new genome is a dynamic, three-dimensional body—as the geneticist Barbara McClintock called it, presciently, in 1983, a “sensitive organ of the cell.”It's going to take a lot of work to convince the readers of The Atlantic that a lot of "old science" is still valid and there's nothing wrong with the old definition of a gene.
The point is not that this is the correct way to understand the genome. The point is that science is not a march toward truth. Rather, as the author John McPhee wrote in 1967, “science erases what was previously true.” Every generation of scientists mulches under yesterday’s facts to fertilize those of tomorrow.
1. The Central Dogma of Molecular Biology.
2. Chapters are not numbered.
3. That would be 22 autosomes, plus one X chromosome plus one Y chromosome = 23!!!
4. We don't have a very good estimate for the total number of genes that specify noncoding RNAs.
5. False History and the Number of Genes 2010 and Facts and Myths Concerning the Historical Estimates of the Number of Genes in the Human Genome