More Recent Comments

Showing posts with label Gene Expression. Show all posts
Showing posts with label Gene Expression. Show all posts

Friday, December 06, 2013

Do you understand this Nature paper on transcription factor binding in different mouse strains?

I've published a few papers on the regulation of transcription of a mouse gene and students in my lab have done the standard promoter-bashing experiment to define transcription factor binding sites. I did ny Ph.D. in a lab that specialized in DNA binding proteins. I've kept up with the basic ideas in eukaryotic gene expression in order to teach undergraduate courses on that topic and in order to write appropriate information in my textbook.

I've been interested in genome organization for several decades and I've been following the literature on pervasive transcription and transcription factor binding in whole genome studies. I'm reasonably familiar with the techniques although I've never done them myself.

I'm not bragging; I'm just saying that I know a little bit about this stuff so when I saw this paper in one of the latest issues of Nature I decided to look more carefully.
Heinz, S., Romanoski, C., Benner, C., Allison, K., Kaikkonen, M., Orozco, L. and Glass, C. (2013) Effect of natural genetic variation on enhancer selection and function. Nature 503:487-492. [doi: 10.1038/nature12615]

Friday, November 01, 2013

Vertebrate Complexity Is Explained by the Evolution of Long-Range Interactions that Regulate Transcription?

The Deflated Ego Problem is a very serious problem in molecular biology. It refers to the fact that many molecular biologists were puzzled and upset to learn that humans have about the same number of genes as all other multicellular eukaryotes. The "problem" is often introduced by stating that the experts working on the human genome project expected at least 100,000 genes but were "shocked' when the first draft of the human genome showed only 30,000 genes (now down to about 25,000). This story is a myth as I document in: Facts and Myths Concerning the Historical Estimates of the Number of Genes in the Human Genome. Truth is, most knowledgeable experts expected that humans would have about the same number of genes as other animals. They realized that the differences between fruit flies and humans, for example, didn't depend on a host of new human genes but on the timing and expression of a mostly common set of genes.

This isn't good enough for many human chauvinists. They are still looking for something special that sets human apart from all other animals. I listed seven possibilities in my post on the deflated ego problem:

Friday, September 27, 2013

Dark Matter Is Real, Not Just Noise or Junk

UPDATE: The title is facetious. I don't believe for one second that most so-called "dark matter" has a function. In fact, there's no such thing as "dark matter." Most of our genome is junk. I mention this because one of the well-known junk DNA kooks is severely irony-impaired and thought that I had changed my mind.
A few hours ago I asked you to evaluate the conclusion of a paper by Venters and Pugh (2013) [Transcription Initiation Sites: Do You Think This Is Reasonable?].

Now I want you to look at the Press Release and tell me what you think [see Scientists Discover the Origins of Genomic "Dark Matter"].

It seems pretty clear to me that Pugh (and probably Venters) actually think they are on to something. Here's part of the press release quoting Franklin "Frank" Pugh, a Professor in the Department of Molecular Biology at Penn State.
The remaining 150,000 initiation machines -- those Pugh and Venters did not find right at genes -- remained somewhat mysterious. "These initiation machines that were not associated with genes were clearly active since they were making RNA and aligned with fragments of RNA discovered by other scientists," Pugh said. "In the early days, these fragments of RNA were generally dismissed as irrelevant since they did not code for proteins." Pugh added that it was easy to dismiss these fragments because they lacked a feature called polyadenylation -- a long string of genetic material, adenosine bases -- that protect the RNA from being destroyed. Pugh and Venters further validated their surprising findings by determining that these non-coding initiation machines recognized the same DNA sequences as the ones at coding genes, indicating that they have a specific origin and that their production is regulated, just like it is at coding genes.

"These non-coding RNAs have been called the 'dark matter' of the genome because, just like the dark matter of the universe, they are massive in terms of coverage -- making up over 95 percent of the human genome. However, they are difficult to detect and no one knows exactly what they all are doing or why they are there," Pugh said. "Now at least we know that they are real, and not just 'noise' or 'junk.' Of course, the next step is to answer the question, 'what, in fact, do they do?'"

Pugh added that the implications of this research could represent one step towards solving the problem of "missing heritability" -- a concept that describes how most traits, including many diseases, cannot be accounted for by individual genes and seem to have their origins in regions of the genome that do not code for proteins. "It is difficult to pin down the source of a disease when the mutation maps to a region of the genome with no known function," Pugh said. "However, if such regions produce RNA then we are one step closer to understanding that disease."
I'm puzzled by such statements. It's been one year since the ENCODE publicity fiasco and there have been all kinds of blogs and published papers pointing out the importance of junk DNA and the distinct possibility that most pervasive transcription is, in fact, noise.

It's possible that Pugh and his postdoc are not aware of the controversy. That would be shocking. It's also possible that they are aware of the controversy but decided to ignore it and not reference any of the papers that discuss alternate explanations of their data. That would be even more shocking (and unethical).

Are there any other possibilities that you can think of?

And while we're at it. What excuse can you imagine that lets the editors of Nature off the hook?

P.S. The IDiots at Evolution News & Views (sic) just love this stuff: As We Keep Saying, There's Treasure in "Junk DNA".


Venters, B.J. and Pugh, B.F. (2013) Genomic organization of human transcription initiation complexes. Nature Published online 18 September 2013 [doi: 10.1038/nature12535] [PubMed] [Nature]

The Extraordinary Human Epigenome

We learned a lot about genes and gene expression in the second half of the 20th century. We learned that genes are transcribed and we have a pretty good understanding of how transcription initiation complexes are formed and how transcription works.

We learned how transcription is regulated through promoter strength, activators, and repressors. Activators and repressors bind to DNA and those binding sites can lie at some distance from the promoter leading to formation of loops of DNA that bring the regulatory proteins into contact with the transcription complex. Much of our basic understanding of this process was derived from detailed studies of bacteriophage and bacterial genes.

THEME:
Transcription

Later on we learned that eukaryotic genes expression was very similar and regulation also required repressors and activators. We discovered that gene expression was associated with chromatin remodeling that opened up regions of the chromosome that were tightly bound to histones in 30nm or higher order structures.

Building on studies in prokaryotes, we learned about temporal gene regulation and differentiation. Much of the work was done in model organisms like Drosophila, yeast, C. elegans, and various mammalian cells in culture.

By the end of the century I was pretty confident that what I wrote in my textbook was a fair representation of the fundamental concepts in gene expression and regulation.

Turns out I was wrong as I just discovered this morning when I read the opening paragraph of a review by Rivera and Ren (2013). Here's what they say ...
More than a decade has passed since the human genome was completely sequenced, but how genomic information directs spatial- and temporal-specific gene expression programs remains to be elucidated (Lander, 2011). The answer to this question is not only essential for understanding the mechanisms of human development, but also key to studying the phenotypic variations among human populations and the etiology of many human diseases. However, a major challenge remains: each of the more than 200 different cell types in the human body contains an identical copy of the genome but expresses a distinct set of genes. How does a genome guide a limited set of genes to be expressed at different levels in distinct cell types?
Wow! The textbooks need to be rewritten! We didn't learn anything in the last century!

It took me the whole first paragraph of this paper to realize that the rest of it was probably going to be worthless unless you were interested in technical details about the field. That's because I'm not as smart as Dan Graur. He only read the title, "Mapping Human Epigenomes" and the abstract before concluding that the authors were speaking in newspeak1 [A “Leading Edge Review” Reminds Me of Orwell (and #ENCODE)].

The Rivera and Ren paper is a "Leading Edge" review in the prestigious journal Cell. It covers all the techniques used to study methylation, histone modification and binding, transcription factor binding, and nucleosome positioning at the genome level. According to the authors, people like me were fooled by studies on individual genes, purified factors, and in vitro binding assays. That didn't really tell us what was going on.

Apparently, the most effective way of learning about the regulation of gene expression in humans is to analyze the entire genome all at once and read off the data from microarrays and computer monitors. (After shoving it through a bunch of code.)
Overwhelming evidence now indicates that the epigenome serves to instruct the unique gene expression program in each cell type together with its genome. The word "epigenetics," coined half a century ago by combining "epigenesis" and "genetics," describes the mechanisms of cell fate commitment and lineage specification during animal development (Holliday, 1990; Waddington, 1959). Today, the "epigenome" is generally used to describe the global, comprehensive view of sequence-independent processes that modulate gene expression patterns in a cell and has been liberally applied in reference to the collection of DNA methylation state and covalent modification of histone proteins along the genome (Bernstein et al., 2007; Bonasio et al., 2010). The epigenome can differ from cell type to cell type, and in each cell it regulates gene expression in a number of ways—by organizing the nuclear architecture of the chromosomes, restricting or facilitating transcription factor access to DNA, and preserving a memory of past transcriptional activities. Thus, the epigenome represents a second dimension of the genomic sequence and is pivotal for maintaining cell-typespecific gene expression patterns.

Not long ago, there were many points of trepidation about the value and utility of mapping epigenomes in human cells (Madhani et al., 2008). At the time, it was suggested that histone modifications simply reflect activities of transcription factors (TFs), so cataloging their patterns would offer little new information. However, some investigators believed in the value of epigenome maps and advocated for concerted efforts to produce such resources (Feinberg, 2007; Henikoff et al., 2008; Jones and Martienssen, 2005). The last five years have shown that epigenome maps can greatly facilitate the identification of potential functional sequences and thereby annotation of the human genome. Now, we appreciate the utility of epigenomic maps in the delineation of thousands of lincRNA genes and hundreds of thousands of cis-regulatory elements (ENCODE Project Consortium et al., 2012; Ernst et al., 2011; Guttman et al., 2009; Heintzman et al., 2009; Xie et al., 2013b; Zhu et al., 2013), all of which were obtained without prior knowledge of cell-type-specific master transcriptional regulators. Interestingly, bioinformatic analysis of tissue-specific cis-regulatory elements has actually uncovered novel TFs regulating specific cellular states.
So, what are all these new discoveries that now elucidate what was previously unknown; namely, "how genomic information directs spatial- and temporal-specific gene expression programs"?

This is a very long review full of technical details so let's skip right to the conclusions.
Six decades ago, Watson and Crick put forward a model of DNA double helix structure to elucidate how genetic information is faithfully copied and propagated during cell division (Watson and Crick, 1953). Several years later, Crick famously proposed the "central dogma" to describe how information in the DNA sequence is relayed to other biomolecules such as RNA and proteins to sustain a cell’s biological activities (Crick, 1970). Now, with the human genome completely mapped, we face the daunting
task to decipher the information contained in this genetic blueprint. Twelve years ago, when the human genome was first sequenced, only 1.5% of the genome could be annotated as protein coding, whereas the rest of the genome was thought to be mostly "junk" (Lander et al., 2001; Venter et al., 2001). Now, with the help of many epigenome maps, nearly half of the genome is predicted to carry specific biochemical activities and potential regulatory functions (ENCODE Project Consortium, et al., 2012). It is conceivable that in the near future the human genome will be completely annotated, with the catalog of transcription units and their transcriptional regulatory sequences fully mapped.
I hope they hurry up. Not only do I have to re-write my description of the Central Dogma2 but I'm going to have to re-write everything I thought I knew about regulation of gene expression and the organization of information in the human genome. That's going to take time so I hope the epigeneticists will publish lots more whole genome studies in the near future so I can understand the new model of gene expression.

Keep in mind that this paper was published in Cell where it was rigorously reviewed by the leading experts in the field. It must be right.


[Image Credit: Moran, L.A., Horton, H.R., Scrimgeour, K.G., and Perry, M.D. (2012) Principles of Biochemistry 5th ed., Pearson Education Inc. page 647 [Pearson: Principles of Biochemistry 5/E] © 2012 Pearson Education Inc.]

1. Newspeak was first described in 1984 proving, once again, that George Orwell (Eric Arthur Blair) was a really smart and prescient guy. For another example see: What Is "Science" According to George Orwell?.

2. Apparently I didn't read the Crick (1970) paper as carefully as they did.

Rivera, C.M. and Ren, B. (2013) Mapping Human Epigenomes. Cell 155:39-55 [doi: 10.1016/j.cell.2013.09.011]

Transcription Initiation Sites: Do You Think This Is Reasonable?

I'm interested in how scientists read the scientific literature and in how they distinguish good science from bad science. I know that when I read a paper I usually make a pretty quick judgement based on my knowledge of the field and my model of how things work. In other words, I look at the conclusions first to see whether they conflict with or agree with my model.

Many of my colleagues do it differently. They focus on the actual experiments and reach a conclusion based on how the perceive the data. If the experiments look good and the data seems reliable then they tentatively accept the conclusions even if they conflict with the model they have in their mind. They are much more likely to revamp their model than I am.

I'm about to give you the conclusions from a recently published paper in Nature. I'd like to hear from all graduate students, postdocs, and scientists on how you react to those conclusions. Do you think the conclusions are reasonable (as long as the experiments are valid) or do you think that the conclusions are unreasonable, indicating that there has to be something wrong somewhere?

The paper is Venters and Pugh (2013). It's title is Genomic organization of human transcription complexes. You don't need to read the paper unless you want to get into a more detailed debate. All I want to hear about is your initial reaction to their final two paragraphs.
Consolidated genomic view of initiation

...The discovery that transcription of the human genome is vastly more pervasive than what produces coding mRNA raises the question as to whether Pol II initiates transcription promiscuously through random collisions with chromatin as biological noise or whether it arises specifically from canonical Pol II initiation complexes in a regulated manner. Our discovery of ~150,000 non-coding promoter initiation complexes in human K562 cells and more in other cell lines suggests that pervasive non-coding transcription is promoter-specific, regulated, and not much different from coding transcription, except that it remains nuclear and non-polyadenylated. An important next question is the extent to which transcription factors regulate production of ncRNA.

We detected promoter transcription initiation complexes at 25% of all ~24,000 human coding genes, and found that there were 18-fold more non-coding complexes than coding. We therefore estimate that the human genome potentially contains as many as 500,000 promoter initiation complexes, corresponding to an average of about one every 3 kilobases (kb) in the non-repetitive portion of the human genome. This number may vary more or less depending on what constitutes a meaningful transcription initiation event. The finding that these initiation complexes are largely limited to locations having well-defined core promoters and measured TSSs indicates that they are functional and specific, but it remains to be determined to what end. Their massive numbers would seem to provide an origin for the so-called dark matter RNA of the genome, and could house a substantial portion of the missing heritability.
Looking forward to hearing from you.

Keep in mind that this is a Nature paper that has been rigorously reviewed by leading experts in the field. Does that influence your opinion?


Venters, B.J. and Pugh, B.F. (2013) Genomic organization of human transcription initiation complexes. Nature Published online 18 September 2013 [doi: 10.1038/nature12535] [PubMed] [Nature]

Thursday, August 29, 2013

Core Misconcept: Epigenetics

Sarah C.P. Williams is a science writer. She published an article in PNAS last February: Epigenetics. Here's the opening paragraphs ...
Despite the fact that every cell in a human body contains the same genetic material, not every cell looks or behaves the same. Long nerve cells stretch out the entire length of an arm or a leg; cells in the retina of the eye can sense light; immune cells patrol the body for invaders to destroy. How does each cell retain its unique properties when, in its DNA-containing nucleus, it has the same master set of genes as every other cell? The answer is in the epigenetic regulation of the genes: the control system that dictates which of many genes a cell uses and which it ignores. The same mechanism could also explain why identical twins—who have identical genes—can develop different diseases, traits, or personalities.

Epigenetic regulation consists of chemical flags, or markers, on genes that are copied along with the genes when the DNA is replicated. Without altering the sequence of DNA’s molecular building blocks, epigenetic changes can alter the way a cell interacts with DNA. These changes can block a cell’s access to a gene, turning it off for good.
Statements like that make me cringe. Not only is she ignoring decades of work on the real explanation of differential gene expression, she is also proposing an explanation that can't possibly live up to the claim she is making.

PNAS should be embarrassed.

Fortunately, I'm not the only one who was upset. Mark Ptashne had the same reaction as several hundred other scientists but he took the time to write up his objections and get them published in the April issue of PNAS [Epigenetics: Core Misconcept]. I'll quote his opening paragraph and then let you follow the link and get educated about real science.
Indeed understanding this problem has been an overarching goal of research in molecular, developmental, and, increasingly, evolutionary biology. And over the past 50 years a compelling answer has emerged from studies in a wide array of organisms. Curiously, the article ignores this body of knowledge, and substitutes for it misguided musings presented as facts.
There was a time when every molecular biology student knew how gene expression was controlled. They knew about the pioneering work in bacteria and 'phage and the exquisite details that were worked out in the '60s, '70s, and '80s. That information has been lost in recent generations. Our current crop of graduate students couldn't tell you how gene expression is controlled in bacteriophage λ.

If you are one of those students then I urge you to read Ptashne's book A Genetic Switch before it goes out of print. If the current trends continue, that information is soon going to pass out of the collective memory of molecular biologists just as it has been forgotten (or never learned) by science writers.


Friday, August 23, 2013

How IDiots Would Activate the GULOP Pseudogene

The enzyme L-glucono-γ-lactone oxidase is required for the synthesis of vitamin C. Humans cannot make this enzyme because the gene for this enzyme is defective [see Human GULOP Pseudogene]. The GenBank entry for this pseudogene is GeneID=2989. GULOP is located on chromosome 8 at p21.1 in a region that is rich in genes.

Here's a diagram that compares what is left of the human GULOP pseudogene with the functional gene in the rat genome.

Some Questions for IDiots

Here's a short quiz for proponents of Intelligent Design Creationism. Let's see if you have been paying attention to real science. Please try to answer the questions below. Supporters of evolution should refrain from answering for a few days in order to give the creationists a chance to demonstrate their knowledge of biology and of evolution.

The bloggers at Evolution News & Views (sic) are promoting another creationist book [see Biological Information]. This time it's a collection of papers from a gathering of creationists held in 2011. The title of the book, Biological Information: New Perspectives suggests that these creationists have learned something new about biochemistry and molecular biology.

One of the papers is by Jonathan Wells: Not Junk After All: Non-Protein-Coding DNA Carries Extensive Biological Information. Here's part of the opening paragraphs.
James Watson and Francis Crick’s 1953 discovery that DNA consists of two complementary strands suggested a possible copying mechanism for Mendel’s genes [1,2]. In 1958, Crick argued that “the main function of the genetic material” is to control the synthesis of proteins. According to the “ Sequence Hypothesis,” Crick wrote that the specificity of a segment of DNA “is expressed solely by the sequence of bases,” and “this sequence is a (simple) code for the amino acid sequence of a particular protein.” Crick further proposed that DNA controls protein synthesis through the intermediary of RNA, arguing that “the transfer of information from nucleic acid to nucleic acid, or from nucleic acid to protein may be possible, but transfer from protein to protein, or from protein to nucleic acid, is impossible.” Under some circumstances RNA might transfer sequence information to DNA, but the order of causation is normally “DNA makes RNA makes protein.” Crick called this the “ Central Dogma” of molecular biology [3], and it is sometimes stated more generally as “DNA makes RNA makes protein makes us.”

The Sequence Hypothesis and the Central Dogma imply that only protein-coding DNA matters to the organism. Yet by 1970 biologists already knew that much of our DNA does not code for proteins. In fact, less than 2% of human DNA is protein-coding. Although some people suggested that non-protein-coding DNA might help to regulate gene expression, the dominant view was that non-protein-coding regions had no function. In 1972, biologist Susumu Ohno published an article wondering why there is “so much ‘ junk’ DNA in our genome” [4].
  1. Crick published a Nature paper on The Central Dogma of Molecular Biology in 1970. Did he and most other molecular biologists actually believe that "only protein-coding DNA matters to the organism?"
  2. Did Crick really say that "DNA makes RNA makes protein" is the Central Dogma or did he say that this was the Sequence Hypothesis? Read the paper to get the answer—the link is below).
  3. Is it true that, in 1970, the majority of molecular biologists did not believe in repressor and activator binding sites (regulatory DNA)?
  4. Is it true that in 1970 molecular biologists knew nothing about the functional importance of non-transcribed DNA sequences such as centromeres and origins of DNA replication?
  5. It is true that most molecular biologists in 1970 had never heard of genes for ribosomal RNAs and tRNAs (non-protein-coding genes)?
  6. If the answer to any of those questions contradicts what Jonathan Wells is saying then why do you suppose he said it?

Crick, F. (1970) Central Dogma of Molecular Biology. Nature 227:561-563. [PDF]

Friday, June 28, 2013

John Mattick on the Importance of Non-coding RNA

John Mattick is a Professor and research scientist at the Garvan Institute of Medical Research at the University of New South Wales (Australia). He received an award from the Human Genome Organization for ....
The Award Reviewing Committee commented that Professor Mattick’s “work on long non-coding RNA has dramatically changed our concept of 95% of our genome”, and that he has been a “true visionary in his field; he has demonstrated an extraordinary degree of perseverance and ingenuity in gradually proving his hypothesis over the course of 18 years.”

Saturday, March 30, 2013

Learning About Evo-Devo

We talked about evolutionary developmental biology (evo-devo) in my class last week. The main issue is whether the proponents of evo-devo are making a substantive contribution to evolutionary theory. Is evo-devo going to be part of an extended modern synthesis, and, if so, how? My own view, which I express to the class, is that the discoveries of developmental biology pretty much confirm what Stephen J. Gould wrote in Ontogeny and Phylogeny back in 1977.
What, then, is at the root of our profound separation? King and Wilson argue convincingly that the decisive differences must involve the evolution of regulation: small changes in the timing of development can have manifold effects upon a final product "Small differences in the timing of activation or in the level of activity of a single gene could in principle influence considerably the systems controlling embryonic development. The organismal differences between chimpanzees and humans would then result chiefly from genetic changes in a few regulatory systems, while amino acid substitutions in general would rarely be a key factor in major adaptive shifts." Differences in regulation may evolve by point mutations of regulatory genes or by rearrangement of gene order caused by such familiar chromosomal events such as inversion, translocation, fusion, and fission. Studies of banding indicate that at least one fusion and ten large inversions and translocations separate chimps and humans.

Stephen J. Gould (1977) Ontogeny and Phylogeny, Harvard University Press, Cambridge Massachusetts, USA pp. 405-406
This helps us understand the history of life, especially the evolution of animals, but it doesn't contribute to evolutionary theory.

PZ Myers is teaching a developmental biology course and his students are dealing with three take-home questions this weekend [What I taught today: O Cruel Taskmaster!]. I'd like to reproduce two of them here since they're very relevant to the debate over the importance of evo-devo.
Question 1: One of the claims of evo devo is that mutations in the regulatory regions of genes are more important in the evolution of form in multicellular organisms than mutations in the coding regions of genes. We’ve discussed examples of both kinds of mutations, but that’s a quantitative claim that won’t be settled by dueling anecdotes. Pretend you’ve been given a huge budget by NSF to test the idea, and design an evodevo research program that would resolve the issue for some specific set of species.
I'd like my students to keep in mind Richard Lenski's ongoing evolution experiment in E. coli. Recall that evolution of the ability to grow on citrate depended mostly on mutations that changed the regulation of citrate utilization genes.

Since we have many examples of mutations that affect regulation of gene expression in bacteria, yeast, and other single-cell organisms, why do the proponents of evo-devo think they're on to something special when they look at development in animals? What is there about the evolution of "form" that changes our views on evolution?
Question 2: Every generation seems to describe the role of genes with a metaphor comparing it to some other technology: it’s a factory for making proteins, or it’s a blueprint, or it’s a recipe. Carroll’s book, Endless Forms Most Beautiful, describes the toolbox genes in terms of “genetic circuitry”, “boolean logic”, “switches and logic gates” — he’s clearly using modern computer technology as his metaphor of choice. Summarize how the genome works using this metaphor, as he does. However, also be aware that it is a metaphor, and no metaphor is perfect: tell me how it might mislead us, too.
Before answering PZ's question about Sean Carroll and metaphors, I'd like my students to remember the quotation I gave them in class. Discuss the use of hyperbole and metaphor in this context.

The key to understanding form is development, the process through which a single-celled egg gives rise to a complex, multi-billion-celled animal. This amazing spectacle stood as one of the great unsolved mysteries of biology for nearly two centuries. And development is intimately connected to evolution because it is through changes in embryos that changes in form arise. Over the past two decades, a new revolution has unfolded in biology. Advances in developmental biology and evolutionary developmental biology (dubbed “Evo Devo”) have revealed a great deal about the invisible genes and some simple rules that shape animal form and function. Much of what we have learned has been so stunning and unexpected that it has profoundly reshaped our picture of how evolution works. Not a single biologist, for example, ever anticipated that the same genes that control the making of an insect’s body and organs also control the making of our bodies.

This book tells the story of this new revolution and its insights into how the animal kingdom has evolved. My goal is to reveal a vivid picture of the process of making animals and how various kinds of changes in that process have molded the different kinds of animals we know today and those from the fossil record.

Sean B. Carroll Endless Forms Most Beautiful: The New Science of Evo Devo, W.W. Norton & Co., New York (2005) p. x
I'd also like Sandwalk readers to keep in mind the recent ENCODE publications. They talked extensively about genetic circuits and regulation. In fact, their major "finding" was the idea that our genome is full of regulatory elements; so many, in fact, that most of what we thought was junk DNA is actually part of a a vast control circuit. Has this emphasis on a multitude of switches and controls been misleading or is it turning out to be correct?

I would ask a third question. The evolution of toolkit genes (i.e. transcription factors) make it possible to evolve many different body plans with only a small number of mutations. It helps explain the Cambrian explosion. Given our current understanding of evolution, is it possible to select for a evolution of a toolkit that has this potential for future evolution? Explain your answer.


Wednesday, January 16, 2013

Why Do the IDiots Have So Much Trouble Understanding Introns?

Most eukaryotic genes have introns. Introns make up about 18% of the DNA sequences in our genome. Most of these sequences are junk but introns are functional and up to 80bp of each intron is required for proper splicing. The essential sequences contain the 5′ splice site (~10 bp); the 3′ splice site (~30 bp): the branch site (~10 bp); and enough additional RNA to form a loop (~30 bp). The branch site and the splice sites are where specific proteins bind to the mRNA precursor [Junk in Your Genome: Protein-Encoding Genes]. It turns out that within introns about 0.37% of the genome is essential and about 17% is junk.


Tuesday, September 11, 2012

ENCODE/Junk DNA Fiasco: John Timmer Gets It Right!

John Timmer is the science editor at Ars Technica. Yesterday he published the best analysis of the ENCODE/junk DNA fiasco that any science writer has published so far [Most of what you read was wrong: how press releases rewrote scientific history].

How did he manage to pull this off? It's not much of a secret. He knew what he was writing about and that gives him an unfair advantage over most other science journalists.

Let me show you what I mean. Here's John Timmer's profile on the Ars Technica website.
John is Ars Technica's science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. John has done over a decade's worth of research in genetics and developmental biology at places like Cornell Medical College and the Memorial Sloan-Kettering Cancer Center. He's been a speaker at the annual meeting of the National Association of Science Writers and the Science Online meetings, and he's one of the organizers of the Science Online NYC discussion series. In addition to being Ars' science content wrangler, John still teaches at Cornell and does freelance writing, editing, and programming.
See what I mean? He has a degree in biochemistry and another one in molecular biology. People like that shouldn't be allowed to write about the ENCODE results because they might embarrass the scientists.

Sunday, September 09, 2012

Ed Yong Updates His Post on the ENCODE Papers

For decades we've known that less than 2% of the human genome consists of exons and that protein encoding genes represent more than 20% of the genome. (Introns account for the difference between exons and genes.) [What's in Your Genome?]. There are about 20,500 protein-encoding genes in our genome and about 4,000 genes that encode functional RNAs for a total of about 25,000 genes [Humans Have Only 20,500 Protein-Encoding Genes]. That's a little less than the number predicted by knowledgeable scientists over four decades ago [False History and the Number of Genes]. The definition of "gene" is somewhat open-ended but, at the very least, a gene has to have a function [Must a Gene Have a Function?].

We've known about all kinds of noncoding DNA that's functional, including origins of replication, centromeres, genes for functional RNAs, telomeres, and regulatory DNA. Together these functional parts of the genome make up almost 10% of the total. (Most of the DNA giving rise to introns is junk in the sense that it is not serving any function.) The idea that all noncoding DNA is junk is a myth propagated by scientists (and journalists) who don't know their history.

We've known about the genetic load argument since 1968 and we've known about the C-Value "Paradox" and it's consequences since the early 1970's. We've known about pseudogenes and we've known that almost 50% of our genome is littered with dead transposons and bits of transposons. We've known that about 3% of our genome consists of highly repetitive DNA that is not transcribed or expressed in any way. Most of this DNA is functional and a lot of it is not included in the sequenced human genome [How Much of Our Genome Is Sequenced?]. All of this evidence indicates that most of our genome is junk. This conclusion is consistent with what we know about evolution and it's consistent with what we know about genome sizes and the C-Value "Paradox." It also helps us understand why there's no correlation between genome size and complexity.

Friday, September 07, 2012

More Expert Opinion on Junk DNA from Scientists

The Nature issue containing the latest ENCODE Consortium papers also has a New & Views article called "Genomics: ENCODE explained" (Ecker et al., 2012). Some of these scientist comment on junk DNA.

For exampleshere's what Joseph Ecker says,
One of the more remarkable findings described in the consortium's 'entrée' paper is that 80% of the genome contains elements linked to biochemical functions, dispatching the widely held view that the human genome is mostly 'junk DNA'. The authors report that the space between genes is filled with enhancers (regulatory DNA elements), promoters (the sites at which DNA's transcription into RNA is initiated) and numerous previously overlooked regions that encode RNA transcripts that are not translated into proteins but might have regulatory roles.
And here's what Inês Barroso, says,
The vast majority of the human genome does not code for proteins and, until now, did not seem to contain defined gene-regulatory elements. Why evolution would maintain large amounts of 'useless' DNA had remained a mystery, and seemed wasteful. It turns out, however, that there are good reasons to keep this DNA. Results from the ENCODE project show that most of these stretches of DNA harbour regions that bind proteins and RNA molecules, bringing these into positions from which they cooperate with each other to regulate the function and level of expression of protein-coding genes. In addition, it seems that widespread transcription from non-coding DNA potentially acts as a reservoir for the creation of new functional molecules, such as regulatory RNAs.
If this were an undergraduate course I would ask for a show of hands in response to the question, "How many of you thought that there did not seem to be "defined gene-regulatory elements" in noncoding DNA?"

I would also ask, "How many of you have no idea how evolution could retain "useless" DNA in our genome?" Undergraduates who don't understand evolution should not graduate in a biological science program. It's too bad we don't have similar restrictions on senor scientists who write News & Views articles for Nature.

Jonathan Pritchard and Yoav Gilad write,
One of the great challenges in evolutionary biology is to understand how differences in DNA sequence between species determine differences in their phenotypes. Evolutionary change may occur both through changes in protein-coding sequences and through sequence changes that alter gene regulation.

There is growing recognition of the importance of this regulatory evolution, on the basis of numerous specific examples as well as on theoretical grounds. It has been argued that potentially adaptive changes to protein-coding sequences may often be prevented by natural selection because, even if they are beneficial in one cell type or tissue, they may be detrimental elsewhere in the organism. By contrast, because gene-regulatory sequences are frequently associated with temporally and spatially specific gene-expression patterns, changes in these regions may modify the function of only certain cell types at specific times, making it more likely that they will confer an evolutionary advantage.

However, until now there has been little information about which genomic regions have regulatory activity. The ENCODE project has provided a first draft of a 'parts list' of these regulatory elements, in a wide range of cell types, and moves us considerably closer to one of the key goals of genomics: understanding the functional roles (if any) of every position in the human genome.
The problem here is the hype. While it's true that the ENCODE project has produced massive amounts of data on transcription binding sites etc., it's a bit of an exaggeration to say that "until now there has been little information about which genomic regions have regulatory activity." Twenty-five years ago, my lab published some pretty precise information about the parts of the genome regulating activity of a mouse hsp70 gene. There have been thousands of other papers on the the subject of gene regulatory sequences since then. I think we actually have a pretty good understanding of gene regulation in eukaryotes. It's a model that seems to work well for most genes.

The real challenge from the ENCODE Consortium is that they question that understanding. They are proposing that huge amounts of the genome are devoted to fine-tuning the expression of most genes in a vast network of binding sites and small RNAs. That's not the picture we have developed over the past four decades. If true, it would not only mean that a lot less DNA is junk but it would also mean that the regulation of gene expression is fundamentally different than it is in E. coli.



[Image Credit: ScienceDaily: In Massive Genome Analysis ENCODE Data Suggests 'Gene' Redefinition.

Ecker, J.R., Bickmore, W.A., Barroso, I., Pritchard, J.K. (2012) Genomics: ENCODE explained. Nature 489:52-55. [doi:10.1038/489052a]
Yoav Gilad
& Eran Segal

Sunday, January 09, 2011

Splicing Error Rate May Be Close to 1%

Alex Ling alerted me to an important paper in last month's issue of PLoS Genetics. Pickrell et al. (2010) looked at low abundance RNAs in order to determine how many transcripts showed evidence of possible splicing errors. They found a lot of "alternative" spliced transcripts where the new splice junction was not conserved in other species and was used rarely. They attribute this to splicing errors. Their calculation suggests that the splicing apparatus makes a mistake 0.7% of the time.

This has profound implication for the interpretation of alternative splicing data. If Pickerell et al. are correct—and they aren't the only ones to raise this issue—then claims about alternative splicing being a common phenomenon are wrong. At the very least, those claims are controversial and every time you see such a claim in the scientific literature it should be accompanied by a statement about possible artifacts due to splicing errors. If you don't see that mentioned in the paper then you know you aren't dealing with a real scientist.

Here's the abstract and the author summary ..
Abstract

While the majority of multiexonic human genes show some evidence of alternative splicing, it is unclear what fraction of observed splice forms is functionally relevant. In this study, we examine the extent of alternative splicing in human cells using deep RNA sequencing and de novo identification of splice junctions. We demonstrate the existence of a large class of low abundance isoforms, encompassing approximately 150,000 previously unannotated splice junctions in our data. Newly-identified splice sites show little evidence of evolutionary conservation, suggesting that the majority are due to erroneous splice site choice. We show that sequence motifs involved in the recognition of exons are enriched in the vicinity of unconserved splice sites. We estimate that the average intron has a splicing error rate of approximately 0.7% and show that introns in highly expressed genes are spliced more accurately, likely due to their shorter length. These results implicate noisy splicing as an important property of genome evolution.

Author Summary

Most human genes are split into pieces, such that the protein-coding parts (exons) are separated in the genome by large tracts of non-coding DNA (introns) that must be transcribed and spliced out to create a functional transcript. Variation in splicing reactions can create multiple transcripts from the same gene, yet the function for many of these alternative transcripts is unknown. In this study, we show that many of these transcripts are due to splicing errors which are not preserved over evolutionary time. We estimate that the error rate in the splicing of an intron is about 0.7% and demonstrate that there are two major types of splicing error: errors in the recognition of exons and errors in the precise choice of splice site. These results raise the possibility that variation in levels of alternative splicing across species may in part be to variation in splicing error rate.


Pickrell, J.K., Pai, A.A., and Gilad, Y., Pritchard, J.P. (2010) Noisy Splicing Drives mRNA Isoform Diversity in Human Cells. PLoS Genet 6(12): e1001236. doi:10.1371/journal.pgen.1001236

Thursday, May 20, 2010

Junk RNA or Imaginary RNA?

RNA is very popular these days. It seems as though new varieties of RNA are being discovered just about every month. There have been breathless reports claiming that almost all of our genome is transcribed and most of the this RNA has to be functional even though we don't yet know what the function is. The fervor with which some people advocate a paradigm shift in thinking about RNA approaches that of a cult follower [see Greg Laden Gets Suckered by John Mattick].

We've known for decades that there are many types of RNA besides messenger RNA (mRNA encodes proteins). Besides the standard ribosomal RNAs and transfer RNAs (tRNAs), there are a variety of small RNAs required for splicing and many other functions. There's no doubt that some of the new discoveries are important as well. This is especially true of small regulatory RNAs.

However, the idea that a huge proportion of our genome could be devoted to synthesizing functional RNAs does not fit with the data showing that most of our genome is junk [see Shoddy But Not "Junk"?]. That hasn't stopped RNA cultists from promoting experiments leading to the conclusion that almost all of our genome is transcribed.

Late to the Party

Several people have already written about this paper including Carl Zimmer and PZ Myers. There are also summaries in Nature News and PLoS Biology.
That may change. A paper just published in PLoS Biology shows that the earlier work was prone to artifacts. Some of those RNAs may not even be there and others are present in tiny amounts.

The work was done by Harm van Bakel in Tim Hughes' lab, right here in Toronto. It's only a few floors, and a bridge, from where I'm sitting right now. The title of their paper tries to put a positive spin on the results: "Most 'Dark Matter' Transcripts Are Associated With Known Genes" [van Bakel et. al. (2010)]. Nobody's buying that spin. They all recognize that the important result is not that non-coding RNAs are mostly associated with genes but the fact that they are not found in the rest of the genome. In other words, most of our genome is not transcribed in spite of what was said in earlier papers.

Van Bekal compared two different types of analysis. The first, called "tiling arrays," is a technique where bulk RNA (cDNA, actually) is hybridized to a series of probes on a microchip. The probes are short pieces of DNA corresponding to genomic sequences spaced every few thousand base pairs along each chromosome. When some RNA fragment hybridizes to one of these probes you score that as a "hit." The earlier experiments used this technique and the results indicated that almost every probe could hybridize an RNA fragment. Thus, as you scanned the chip you saw that almost every spot recorded a "hit." The conclusion is that almost all of the genome is transcribed even though only 2% corresponds to known genes.

The second type of analysis is called RNA-Seq and it relies on direct sequencing of RNA fragments. Basically, you copy the RNA into DNA, selecting for small 200 bp fragments. Using new sequencing technology, you then determine the sequence of one (single end) or both ends (paired end) of this cDNA. You may only get 30 bp of good sequence information but that's sufficient to place the transcript on the known genome sequence. By collecting millions of sequence reads, you can determine what parts of the genome are transcribed and you can also determine the frequency of transcription. The technique is much more quantitative than tiling experiments.

Van Bekel et al. show that using RNA-Seq they detect very little transcription from the regions between genes. On the other hand, using tiling arrays they detect much more transcription from these regions. They conclude that the tiling arrays are producing spurious results—possibly due to cross-hybridization or possibly due to detection of very low abundance transcripts. In other words, the conclusion that most of our genome is transcribed may be an artifact of the method.

The parts of the genome that are presumed to be transcribed but for which there is no function is called "dark matter." Here's the important finding in the author's own words.
To investigate the extent and nature of transcriptional dark matter, we have analyzed a diverse set of human and mouse tissues and cell lines using tiling microarrays and RNA-Seq. A meta-analysis of single- and paired-end read RNA-Seq data reveals that the proportion of transcripts originating from intergenic and intronic regions is much lower than identified by whole-genome tiling arrays, which appear to suffer from high false-positive rates for transcripts expressed at low levels.
Many of us dismissed the earlier results as transcriptional noise or "junk RNA." We thought that much of the genome could be transcribed at a very low level but this was mostly due to accidental transcription from spurious promoters. This low level of "accidental" transcription is perfectly consistent with what we know about RNA polymerase and DNA binding proteins [What is a gene, post-ENCODE?, How RNA Polymerase Binds to DNA]. Although we might have suspected that some of the "transcription" was a true artifact, it was difficult to see how the papers could have failed to consider such a possibility. They had been through peer review and the reviewers seemed to be satisfied with the data and the interpretation.

That's gonna change. I suspect that from now on everybody is going to ignore the tiling array experiments and pretend they don't exist. Not only that, but in light of recent results, I suspect more and more scientists will announce that they never believed the earlier results in the first place. Too bad they never said that in print.


van Bakel, H., Nislow, C., Blencowe, B. and Hughes, T. (2010) Most "Dark Matter" Transcripts Are Associated With Known Genes. PLoS Biology 8: e1000371 [doi:10.1371/journal.pbio.1000371]

Thursday, May 06, 2010

I Don't Have Time for This!

 
The banner headline on the front page of The Toronto Star says, "U of T cracks the code." You can read the newspaper article on their website: U of T team decodes secret messages of our genes. ("U of T" refers to the University of Toronto - our newspaper thinks we're the only "T" university in the entire world.)

The hyperbole is beyond disgusting.

The work comes from labs run by Brendan Frey and Ben Blencowe and it claims to have discovered the "splicing code" mediating alternative splicing (Barash et al., 2010). You'll have to read the paper yourself to see it the headlines are justified. It's clear that Nature thought it was important 'cause they hyped it on the front cover of this week's issue.

The frequency of alternative splicing is a genuine scientific controversy. We've known for 30 years that some genes are alternatively spliced to produce different protein products. The controversy is over what percentage of genes have genuine biologically relevant alternative splice variants and what percentage simply exhibit low levels of inappropriate splicing errors.

Personally, I think most of the predicted splice variants are impossible. The data must be detecting splicing errors [Two Examples of "Alternative Splicing"]. I'd be surprised if more than 5% of human genes are alternatively spliced in a biologically relevant manner.

Barash et al. (2010) disagree. They begin their paper with the common mantra of the true believers.
Transcripts from approximately 95% of multi-exon human genes are spliced in more than one way, and in most cases the resulting transcripts are variably expressed between different cell and tissue types. This process of alternative splicing shapes how genetic information controls numerous critical cellular processes, and it is estimated that 15% to 50% of human disease mutations affect splice site selection.
I don't object to scientists who hold points of view that are different than mine—even if they're wrong! What I object to is those scientists who promote their personal opinions in scientific papers without even acknowledging that there's a genuine scientific controversy. You have to look very carefully in this paper for any mention of the idea that a lot of alternative splicing could simply be due to mistakes in the splicing machinery. And if that's true, then the "splicing code" that they've "deciphered" is just a way of detecting when the machinery will make a mistake.

We've come to expect that science writers can be taken in by scientists who exaggerate the importance of their own work, so I'm not blaming the journalists at The Toronto Star and I'm not even blaming the person who wrote the University of Toronto press release [U of T researchers crack 'splicing code']. I'll even forgive the writers at Nature for failing to be skeptical [The code within the code] [Gene regulation: Breaking the second genetic code].

It's scientists who have to accept the blame for the way science is presented to the general public.
Frey compared his computer decoder to the German Enigma encryption device, which helped the Allies defeat the Nazis after it fell into their hands.

“Just like in the old cryptographic systems in World War II, you’d have the Enigma machine…which would take an instruction and encode it in a complicated set of symbols,” he said.

“Well, biology works the same way. It turns out to control genetic messaging it makes use of a complicated set of symbols that are hidden in DNA.”
Given the number of biological activities needed to grow and govern our bodies, scientists had believed humans must have 100,000 genes or more to direct those myriad functions.

But that genomic search of the 3 billion base pairs that make up the rungs of our twisting DNA ladders revealed a meagre 20,000 genes, about the same number as the lowly nematode worm boasts.

“The nematode has about 1,000 cells, and we have at least 1,000 different neuron (cells) in our brains alone,” said Benjamin Blencowe, a U of T biochemist and the study’s co-senior author.

To achieve this huge complexity, our genes must be monumental multi-taskers, with each one having the potential to do dozens or even hundreds of different things in different parts of the body.

And to be such adroit role switchers, each gene must have an immensely complex set of instructions – or a code – to tell them what to do in any of the different tissues they need to perform in.
I wish I had time to present a good review of the paper but I don't. Sorry.


Barash, Y., Calarco, J.A., Gao, W., Qun Pan, Q., Wang, X., Shai, O., Benjamin J. Blencowe, and Frey, B.J. (2010) Deciphering the splicing code. Nature 465: 53–59. [doi:10.1038/nature09000] [Supplementary Information]

Monday, October 19, 2009

What's the Connection between Hpa II and CpG Islands?

 
Epigenetics is all the rage today but the idea that gene expression could be regulated by modifying DNA and/or chromatin has been around for three decades.

Methylation is one of the ways that DNA can be modified and methylation at specific sites can be heritable. This observation grew out of studies on restriction/modification systems where DNA is protected from the action of restriction endonucleases by methylating the bases.

I didn't realize that the study of restriction enzymes led to the discovery of methylated regions of eukaryotic DNA. Find out how by reading an interview with Adrian Bird in PLoS Genetics: On the Track of DNA Methylation: An Interview with Adrian Bird.

This is also a good example of chance and serendipity in science. You can't plan for this stuff to happen—but that doesn't prevent politicians and administrators from trying.


Wednesday, October 07, 2009

The Ribosome and the Central Dogma of Molecular Biology

The Nobel Prize website usually does an excellent job of explaining the science behind the prizes. The STRUCTURE AND FUNCTION OF THE RIBOSOME is a good explanation of reasons why the 2009 Nobel Prize in Chemistry was awarded for work on the ribosome.

Unfortunately, the article begins by perpetuating a basic misunderstanding of the Central Dogma of Molecular Biology.
The ribosome and the central dogma. The genetic information in living systems is stored in the genome sequences of their DNA (deoxyribonucleic acid). A large part of these sequences encode proteins which carry out most of the functional tasks in all extant organisms. The DNA information is made available by transcription of the genes to mRNAs (messenger ribonucleic acids) that subsequently are translated into the various amino acid sequences of all the proteins of an organism. This is the central dogma (Crick, 1970) of molecular biology in its simplest form (Figure 1)

This is not the Central Dogma according to Crick (1970). I explain this in a posting from two years ago [Basic Concepts: The Central Dogma of Molecular Biology].

In both his original paper (Crick, 1958) and the 1970 update, Crick made it very clear that the Central Dogma of Molecular Biology is ....
The Central Dogma. This states that once “information” has passed into protein it cannot get out again. In more detail, the transfer of information from nucleic acid to nucleic acid, or from nucleic acid to protein may be possible, but transfer from protein to protein, or from protein to nucleic acid is impossible. Information means here the precise determination of sequence, either of bases in the nucleic acid or of amino acid residues in the protein.
The diagram that's usually attributed to the central dogma is actually the Sequence Hypothesis. Crick was well aware of the confusion and that's why he wrote the 1970 paper. It was at a time when the so-called "Central Dogma" had been "overthrown" byt the discovery of reverse transcriptase.

Since then the false version of the Central Dogma has been disproven dozens and dozens of times—it's a minor cottage industry.

Here's what Crick says about this false version of the Central Dogma in his 1970 paper—the one quoted at the top of this page.
It is not the same, as is commonly assumed, as the sequence hypothesis, which was clearly distinguished from it in the same article (Crick, 1958). In particular, the sequence hypothesis was a positive statement, saying that the (overall) transfer nucleic acid → protein did exist, whereas the central dogma was a negative statement saying that transfers from protein did not exist.
Let's try and get it right. It will have the great benefit of stopping us from putting up with any new papers that refute the Central Dogma of Molecular Biology!

It will also encourage critical thinking. Haven't you ever wondered why there is a Central Dogma when reverse transcriptase, splicing, epigenetics, post-translational modification, chromatin rearrangements, small regulatory RNAs, and just about everything else under the sun, supposedly refutes it?


Crick, F.H.C. (1958) On protein synthesis. Symp. Soc. Exp. Biol. XII:138-163,

Crick, F. (1970) Central Dogma of Molecular Biology. Nature 227, 561-563. [PDF file]

Sunday, July 19, 2009

The Origin of Dachshunds

 
A draft sequence of the dog (Canis lupus familiaris) genome has been available for several years. One of the reasons for working with dog genes and genomes is the fact that there are many different breeds. Since these breeds differ genetically and morphologically, there's a distinct possibility that the genes for various characteristics can be identified by comparing variants from different breeds.

One of the exciting possibilities is that some interesting behavioral genes could be identified since many breeds of dog are loyal, easy to train, and intelligent.1

In addition to possible behavioral genes, one can identify many genes affecting morphology. One of them is the gene affecting short legs in various breeds, including dachshunds. Parker et al. (2009) identified an extra gene in short-legged breeds. The extra gene is a retrogene of the normal gene encoding fibroblast growth factor 4 (fgf4).

What is a retrogene? It's a derivative of the mature mRNA of a normal gene. Recall that most mammalian genes have introns and the primary transcript contains extra sequences at the two ends, plus exons that encode the amino acid sequence of a protein, plus intron sequences that separate the exons.

This primary transcript is processed to produce the mature messenger RNA (mRNA) that is subsequently translated by the translation machinery in the cytoplasm. During processing, the intron sequences are spliced out, a 5′ cap is added to the beginning of the RNA, and a string of "A" residues is added to the terminus (= poly A tail).


On rare occasions the mature mRNA can be accidentally copied by an enzyme called reverse transcriptase that converts RNA into single-stranded DNA. (The reverse of transcription, which copies DNA into RNA.) The single-stranded DNA molecule can be duplicated by DNA polymerase to make a double-stranded copy of the original mRNA molecule.

This piece of DNA may get integrated back into the genome by recombination. This is an extremely rare event but over the course of millions of years the genome accumulates many copies of such DNA sequences. In the vast majority of cases the DNA sequence is not expressed because it has been separated from its normal promoter. (Sequences that regulate transcription are usually not present in the primary transcript.) These DNA segments are called pseudogenes because they are not functional. They accumulate mutations at random and the sequence diverges from the sequence of the normal gene from which they were derived.

Sometimes the DNA copy of the mRNA happens to insert near a functional promoter and the DNA is transcribed. In this case the gene may be expressed and additional protein is made. Note that the new retrogene doesn't have introns so the primary transcript doesn't require splicing in order to join the coding regions (exons). The fgf4 retrogene inserted into the middle of a LINE transposable elements and the LINE promoter probably drives transcription of the retrogene.

The short-legged phenotype is probably due to inappropriate expression of the retrogene in the embryo in tissues that generate the long bones of the legs. The inappropriate expression of fibroblast growth factor 4 causes early calcification of cells in the growth plates—these are the cells that regulate extension of the growing bones. The result is short bones that are often curved.

Breeders selected for this anomaly and this is part of what contributed to the origin of dachshunds and other short-legged dogs.

There's a reason why dogs are such good species for discovering the functions of many genes. It's because of the huge variety of different breeds. Is there a reason why the species has more morphological variation than other species of animals? Probably, but we don't know the reason. Here's how Parker et al. begin their paper.
The domestic dog is arguably the most morphologically diverse species of mammal and theories abound regarding the source of its extreme variation (1). Two such theories rely on the structure and instability of the canine genome, either in an excess of rapidly mutating microsatellites (2) or an abundance of overactive SINEs (3), to create increased variability from which to select for new traits. Another theory suggests that domestication has allowed for the buildup of mildly deleterious mutations that, when combined, create the variation observed in the domestic dog (4).
We still have a lot to learn about evolution.


[Photo Credit: Dog Gone Good]

1. You can see why working with the cat genome wouldn't be as productive.

Parker, H.G., Vonholdt, B.M., Quignon, P., Margulies, E.H., Shao, S., Mosher, D.S., Spady, T.C., Elkahloun, A., Cargill, M., Jones, P.G., Maslen, C.L., Acland, G.M., Sutter, N.B., Kuroki, K., Bustamante, C.D., Wayne, R.K., and Ostrander, E.A. (2009) An Expressed Fgf4 Retrogene Is Associated with Breed-Defining Chondrodysplasia in Domestic Dogs. Science. 2009 Jul 16. [Epub ahead of print] [PubMed] [doi: 10.1126/science.1173275]