Friday, February 29, 2008
The Ladies' Privilege
Friday's Urban Legend: TRUE
According to legend, girls can propose to their boyfriends on February 29th. Snopes.com says this is a true custom known as The Ladies Privilege [The Privilege of Ladies]. Back in the thirteenth century a man had to accept a proposal on February 29th or pay a fine. This probably explains why so many men were off fighting wars at the end of February during a leap year.
It's based on the idea that February 29th is an unusual day and unusual things are permitted on that day only. One of the unusual things that is allowed is for women to propose marriage. Nowadays this is much more common on all the other days but apparently there was a time when only the man could propose marriage. I wonder what those times were like when men were in control?
Wednesday, February 27, 2008
The State of Science Blogging
There's an interesting discussion going on over at Bayblab on The State of Science Blogging. The commenters are responding to some provocation by Anonymous Coward[1] who took a look at the top five science blogs and said,
Of those only Cognitive daily is consistantly talking about peer-reviewed research. Why is that? Perhaps there is less appeal in discussing recent papers than bashing creationists. But bashing creationists is almost too easy, and not very constructive. It's been said before, you can't reason somebody out of a position in which they didn't reason themselves into. And it worries me because to the lay audience listening to PZ Myers (the 800lb gorilla), it would seem that science's purpose is to attack religion. In fact I suspect the blog gets most of its traffic from creationists. According to technorati, his top tags are "Creationism, godlessness, humor, kooks, politics, religion, weblog, weirdness", so should it really count as a science blog?One of the most interesting comments comes from Dave Munger of Cognitive Daily ....
If you examine the elephant in the room, ScienceBlogs, the trend is maintained: politics, religion books, technology, education and music are tagged more often than biology or genetics. This suggests that their primary motives are entertainment rather than discussing science. Why? Because it pays. Seed Magazine and the bloggers themselves profit from the traffic. That's right, Seed actually pays these bloggers for their posts. And the whole ScienceBlogs thing is a little incestuous, they really like linking to each other, but not so much to the little blogs. I'm afraid gone is the amateur blogger, and in is the professional gonzo science journalist. Might as well read Seed magazine.
So... the most popular science blogs cover the most popular topics related to science?This is a very important point. Many of us are interested in blogging about science and in teaching science. But you can't be an effective advocate for science if you don't have an audience. One way to get an audience is to blog about science related issues that are controversial and then sneak in some good science blogs when people come to visit.
You also seem to be saying that you wish these bloggers would write about less popular topics. But that would make them... less popular. And then other science blogs would become the most popular. Then you could complain about those blogs.
At least you'd have something to write about.
In my case, that's not the only motive for blogging about rationalism and superstition. I happen to have (at least) two interests in life and I like to blog about everything that interests me. As it turns out, there are more people interested in the conflict between science and religion—or the war in the Middle East—than in hard-core science. I posted a whole series of articles on The Three Domain Hypothesis and got only a handful of comments. The series on junk DNA is bringing in just a trickle of interested readers. On the other hand, when I post about religion or politics there are dozens of comments and a lively discussion ensues.
1. I don't like linking to anonymous bloggers. In the future I'm going to make it a policy to only link to bloggers who identify themselves, except under rare circumstances.
Thursday, February 21, 2008
Tangled Bank #99
The latest issue of Tangled Bank is #99. It's hosted by Greg Laden at GregLaden's Blog [The Tangled Bank].
This is the February 20, 2008 edition of The Tangled Bank web carnival. The next edition will be hosted at Archaeoporn.
If you want to submit an article to Tangled Bank send an email message to host@tangledbank.net. Be sure to include the words "Tangled Bank" in the subject line. Remember that this carnival only accepts one submission per week from each blogger. For some of you that's going to be a serious problem. You have to pick your best article on biology.
In the Cafeteria
Yesterday we were at the Musée d'Orsay. We sat down to have a coffee in the museum cafeteria. As you can see, some cafeterias in France are a little more fancy than the average museum cafeteria in North America.
My wife's weird sense of humor produced this picture of me admiring the statues.
Les Invalides
This is my fourth visit to Paris but it's the first time I've been to Les Invalides. The tomb of Napoleon Bonaparte is much more impressive than I ever expected.
Are Brussel Sprouts Bad for You?
Probably not, and that's a good thing because I like Brussel sprouts. They may be OK for humans but they're bad for aphids [Eat up all of your Brussels sprouts -- unless you're an aphid].
I'm going to be in Brussels today. I'm looking forward to a nice meal of Brussel sprouts with beer and chocolate.
Wednesday, February 20, 2008
An IDiot Software Developer Opines About Junk DNA
Randy "I want to believe" Stimpson is a software developer who thinks he understands biology. He has written a post where he claims Most DNA is not Junk. Doppelganger has already pointed out the most obvious faults with Randy's point of view [Software developer PROVES that there is no junkDNA*... and other stuff].I just want to comment on one small paragraph in order to clear up any confusion.
A bacterial genome has 4 million base pairs of DNA and according to Professor Larry Morgan, a bacterial genome doesn’t have junk. So I think it is safe to say that there is at least 1MB of information in the human genome.I'm pretty sure he's referring to me. I'd like to point out for the record that bacterial genomes range in size from about 106 bp up to 107 bp.
All bacterial genomes have junk DNA consisting mostly of defective transposons and defective prophage. In most cases the amount of junk DNA is only a few percent of the genome.
The views expressed by Randy Stimpson are typical of those who desperately want to believe in intelligent design creationism. Junk DNA is not compatible with intelligent design creationism no matter how you cut it.
La Tour Eiffel
These are my pictures of the Eiffel Tower.
It's much easier to take pictures like this from the second level 'cause you don't have to hang out near the outer railing where the risk of falling off is very high. I found that it's much better to say far away from the edge. My knees were much more stable when I did that.
Rue du Cherche Midi
This is a picture of Rue du Cherche Midi right outside our apartment in Paris. The street is full of nice shops (expensive), bakeries, and small restaurants. It's perfectly situated in the middle of the 6th (6e) arrondissement [Map].
There are lots of interesting places to see right in our neighborhood. One of the nearby cafés is shown below along with a close-up of a plaque hanging on the wall of a building in on the next street. It says that John Paul Jones died in that building in July 1792. (It seems as though the prominent men and women of the Revolutionary War were very fond of France.) Incidentally, I took a quick poll of several people in the vicinity and none of them knew who John Paul Jones was.
Monday, February 18, 2008
Gene Genie #25
The 25th edition of Gene Genie has been posted at Gene Sherpas [Gene Genie is Back at The Sherpa!].
There are many posts that were submitted. I have to say, we are doing a good job of covering these genes, but probably won't get through them all. I am excited about a ton of this content. But when we move through genetic discovery, talk always falls back to personalized medicine.The beautiful logo was created by Ricardo at My Biotech Life.
The purpose of this carnival is to highlight the genetics of one particular species, Homo sapiens.
Sunday, February 17, 2008
How Matt Nisbet Conned AAAS
Some of you might recall an earlier posting where I criticize Matt Nisbet for the way he organized a panel at the AAAS meeting without allowing anyone to give the other side of the issue [AAAS Panel: Communicating Science in a Religious America].I sent an email message to Professor Goldston, the panel moderator. Here's part of what I said.
I don't object to Nisbet presenting his point of view at a AAAS meeting but my respect for AAAS and your panel would be greatly diminished if the other side did not get a chance to make its case. Surely you do not want to give the impression that AAAS will only support scientists who agree with Nisbet? Surely you do not want to have a panel where the so-called "New Atheist" perspective is excluded and only religious scientists, or their close allies, are allowed to speak? Is that fair?Mike Dunford has followed up with a posting from several days ago [Yeah, could have seen that one coming].
Please make sure that you have appropriate balance on your panel. Please make sure you don't give the impression that AAAS endorses Nisbet and his ideas about framing. The other side needs to be heard.
It's about time we realized that Matt Nisbet is not a friend of science. He needs to be strongly opposed before he succeeds in fooling any more naive scientists who might fall for his silly nonsense.
This "framing" thing has gone too far.
Wednesday, February 13, 2008
Nobel Laureate: André Lwoff
The Nobel Prize in Physiology or Medicine 1965.
"for their discoveries concerning genetic control of enzyme and virus synthesis"
André Lwoff (1902 - 1994) received the Nobel Prize in Physiology or Medicine for his work on gene expression in bacteriophage λ. He shared the prize with François Jacob and Jacques Monod. The three men worked together at the Institut Pasteur in Paris, France, at a time when it was one of the leading centers of research in this field.
Jacob and Monod were recognize for their pioneering work on The lac Operon. Lwoff worked on the regulation of gene expression in λ. He was responsible for discovering that bacteriophage λ could enter a dormant (lysogenic) state by integrating into the E. coli genome and repressing transcription of all the genes required in the lytic stage of development.
THEME:
Nobel Laureates
The presentation speech was given by Professor Sven Gard, member of the Nobel Committee for Physiology or Medicine of the Royal Caroline Institute.
Your Majesties, Royal Highnesses, Ladies and Gentlemen.
The 1965 Nobel Prize in Physiology or Medicine is shared by Professors Jacob, Lwoff and Monod for «discoveries concerning the genetic regulation of enzyme and virus synthesis».
This particular sphere of research is by no means easy. I heard one of the prize winners, Professor Jacob, forewarn an audience of specialists more or less as follows: «In describing genetic mechanisms, there is a choice between being inexact and incomprehensible». In making this presentation, I shall try to be as inexact as conscience permits.
It has become progressively more apparent that the answer to what has hitherto been romantically termed the secret of life must be sought in the mechanism of action and in the structure of the hereditary material, the genes. This central field of research has naturally been approached from the periphery and in stages. Only in recent years has it been possible to make a serious attack on these fundamental problems.
Several previous Nobel Prize holders: Beadle, Tatum, Crick, Watson, Wilkins, Kornberg and Ochoa have worked in this sphere of research and have formulated certain basic proposals which have enabled the French scholars to continue their efforts. It has been established that one of the principal functions of genes must be to determine the nature and number of enzymes within the cell, the chemical apparatus which controls all the reactions by which the cellular material is formed and the energy necessary for various life processes is released. There is thus a particular gene for each specific enzyme.
In addition, some light has been thrown on the chemical structure of genes. In principle, they have the form of a long double chain consisting of four different components, which can be designated by the letters a, c, g, and t, and with the property of forming pairs with each other. An «a» in one of the chains has to be matched by a «t» in the other, a «g» only by a «c». However, they can be linked along the length of the chain in any order whatsoever, so that the number of possible combinations is virtually unlimited. A chain of genes contains from several hundreds to many thousands of units; such structures can easily carry the specific patterns for the million or more genes which it is estimated that a cell may have.
This model of the genes represents a coded message containing two types of information. If the double chain of a gene is split lengthwise and each half acquires a new partner, then the final result is two double chains identical to the original gene. The model thus contains information relative to the actual structure of the gene, which permits multiplication, in its turn a condition of heredity. When a cell divides, each daughter cell receives an exact copy of the parent gene. The structure of the double chain ensures the stability and permanence required by hereditary material.
But the model can also be read in another way. Along the length of the chain, the letters are grouped in threes in coded words. An alphabet of four letters allows the formation of more than 30 different words and the sequence in the gene of such words provides the structural information for an enzyme or some other protein. Proteins are also chain molecules built up from twenty or so different types of building blocks. To each of these building blocks there corresponds a chemical code word of three letters. The gene thus contains information on the number, nature, and order of the building blocks in a particular protein.
Thus it was already clear that the hereditary blueprint contained the collective structural information for all substances necessary for the functions of the living cell. It was not known how the genetic information was put into effect or transformed into chemical activity. As to the function of the genes, it was thought that they participated in a sort of procreative act when the new cell came into being, producing new substances necessary for the life of the cell, but subsequently lying dormant until the next cell division. It was presumed that the structure and formation of the chemical apparatus determined in this way defined all the regulatory mechanisms necessary for the cell's ability to adapt to changes in the environment and to respond in an adequate manner to stimuli of different types.
To begin with, the group of French workers were able to demonstrate how the structural information of the genes was used chemically. During a process resembling gene multiplication an exact copy of the genetic code is produced, termed a messenger. The latter is then incorporated into the chemical «workshop» of the cell and wound like magnetic tape onto a spool. For each word arriving on the spool, a constructional unit is attracted, which carries a complement to this word and attaches itself there just like a piece of jigsaw puzzle. The building blocks of a protein are selected in this way one by one, aligned, and joined together to form a protein with the appropriate structure.
The messenger substance is, however, short-lived. The tape lasts only for a few recordings. The enzymes are also used up in a similar way. For the cell to maintain its activity, it is thus necessary to have an uninterrupted production of the messenger material, that is to say continuous activity of the corresponding gene.
However, cells can adapt themselves to different external conditions. Thus there must exist some mechanisms controlling the activity of the genes. The research into the nature of these mechanisms is a remarkable achievement which has opened the way for the possible explanation of a series of hitherto mysterious biological phenomena. The discovery of a previously unknown class, the operator genes, which control the structural genes, marks a major breakthrough.
There are two types of operator genes. One type releases chemical signals, which are perceived by a second, receptor, type. The latter controls in its turn one or more structural genes. As long as the signals are being received the receptor remains blocked and the structural genes are inactive. Certain substances coming from outside or formed within the cell can, however, influence the chemical signals in a specific manner, changing their character so that they can no longer influence the receptor. The latter is unblocked and activates the structural genes; messenger material is produced and the synthesis of enzymes or another protein commences.
Control of gene activity is thus of a negative nature; the structural genes are only active if the repressor signals do not arrive. One can speak here of chemical control circuits similar in many ways to electrical circuits, for example in a television set. In the same way, they can be interconnected or arranged in a series to form complicated systems.
With the aid of such control circuits, the free living monocellular organism can produce enzymes when required, or interrupt chemical reactions if they are likely to cause damage; an excitatory stimulus can provoke movement, flight or attack, depending on the nature of the excitation. With such mechanisms it is possible to direct the development of cells into more complicated structures. It is particularly interesting to note that the activity of viruses is controlled, in principle, in the same manner.
Bacteriophages contain a genetic control circuit complete with emitter, receptor, and structural genes. While chemical signals are being sent and received, the virus remains inactive. When incorporated into a cell, it behaves like a normal component of the cell, and can confer on it new properties which may improve its chances of survival in the struggle for existence. However, if the signals are interrupted, the virus is activated, starts to grow rapidly and soon kills the host cell. There is considerable evidence for the view that certain types of tumor virus are incorporated into a normal cell in the same way, thus transforming it into a tumour cell.
We are easily inclined to hold an exaggerated opinion of ourselves in this era of advanced technology. Thus, we are justified in having a great admiration for the achievements in electronics, where, for example, the attempts at miniaturization to reduce component size, to lower the weight, and reduce the volume of apparatus have enabled a rapid development of space science. However, we should bear in mind that, millions of years ago, nature perfected systems far surpassing all that the inventive genius of man has been able to conceive hitherto. A single living cell, measuring several thousandths of a millimetre, contains hundreds of thousands of chemical control circuits, exactly harmonized and functioning infallibly. It is hardly possible to improve on miniaturization further; we are dealing here with a level where the components are single molecules. The group of French workers has opened up a field of research which in the truest sense of the word can be described as molecular biology.
Lwoff represents microbiology, Monod biochemistry, and Jacob cellular genetics. Their decisive discovery would not have been possible without competence and technical knowledge in all these fields, nor without intimate cooperation between the three researchers. But the mystery of life is not resolved simply with knowledge and technical skill. One must also have a gift for observation, a logical intellect, a faculty for the synthesis of ideas, a degree of imagination, and scientific intuition, qualities with which the three workers are liberally endowed.
Research in this field has not yet yielded results that can be used in practice. However, the discoveries have given a strong impetus to research in all domains of biology with far-reaching effects spreading out like ripples in the water. Now that we know the nature of such mechanisms, we have the possibility of learning to master them, with all the consequences which that will surely entail for practical medicine.
François Jacob, André Lwoff, Jacques Monod. Thanks to your technically unimpeachable experiments and your ingenious and logical deductions, you have gained a more intimate familiarity with the nature of vital functions than anyone before you has done. Action, coordination, adaptation, variation - these are the most striking manifestations of living matter. By placing more emphasis on dynamic activity and mechanisms than on structure, you have laid the foundations for the science of molecular biology in the true sense of the term. In the name of the Caroline Institute, I ask you to accept our admiration and our most sincere congratulations. Finally, I invite you to come down from the platform to receive the prize from His Majesty the King.
A Canadian in Paris
Tuesday, February 12, 2008
Goodbye Timmy's
Transcription Factors Bind Thousands of Active and Inactive Regions in the Drosophila Blastoderm
A newly published paper in PLoS Biology (Li et al 2008) finds that many transcription factors are non-specifiaccly bound to DNA and they may not be involved in regulating gene expression at most binding sites. For an explanation of why this shold not be a surprise see Repression of the lac Operon.
Here's the Author Summary of the paper.
One of the largest classes of regulatory proteins in animals, sequence-specific DNA binding transcription factors determine in which cells genes will be expressed and so control the development of an animal from a single cell to a morphologically complex adult. Understanding how this process is coordinated depends on knowing the number and types of genes that each transcription factor binds and regulates. Using immunoprecipitation of in vivo crosslinked chromatin coupled with DNA microarray hybridization (ChIP/chip), we have determined the genomic binding sites in early embryos of six transcription factors that play a crucial role in early development of the fruit fly Drosophila melanogaster. We find that these proteins bind to several thousand genomic regions that lie close to approximately half the protein coding genes. Although this is a much larger number of genes than these factors are generally thought to regulate, we go on to show that whereas the more highly bound genes generally look to be functional targets, many of the genes bound at lower levels do not appear to be regulated by these factors. Our conclusions differ from those of other groups who have not distinguished between different levels of DNA binding in vivo using similar assays and who have generally assumed that all detected binding is functional.
Li et al. (2008) Transcription Factors Bind Thousands of Active and Inactive Regions in the Drosophila Blastoderm. [PLoS Biology]
Goodbye Toronto
Happy Birthday Charles Darwin
Charles Robert Darwin was born on this day in 1809. Darwin was the greatest scientist who ever lived.
In honor of his birthday, and given that this is a year of politics in America, I thought it would be fun to post something about Darwin's interactions with politicians. The historical account is from Janet Browne's excellent biography (Brown 2002).
William Gladstone (photo below) was an orthodox Christian. He was not a fan of evolution. In March 1877 Gladstone was leader of the Liberal party and a former Prime Minister of the most powerful country in the world. He was spending the weekend with John Lunnock—a well-known liberal—and a few other friends, including Thomas Huxley.
They decided to walk over to Darwin's House in Downe. This was 18 years after the publication of Origins and Darwin was a famous guy. The guests were cordially received by Darwin and his wife Emma. Darwin and Emma were life-long liberals and they were honored by Gladstone's visit. A few days later, Darwin wrote a note to his friend saying,
Our quiet, however, was broken a couple of days ago by Gladstone calling here.—I never saw him before & was much pleased with him: I expected a stern, overwhelming sort of man, but found him as soft & smooth as butter, & very pleasant. He asked me whether I thought that the United States would hereafter play a much greater part in the history of the world than Europe. I said that I thought it would, but why he asked me, I cannot conceive & I said that he ought to be able to form a far better opinion,—but what that was he did not at all let out.A few years later Gladstone sent Darwin one of his essays on Homer. Darwin gratefully acknowledged the gesture.
In 1881, when Gladstone was Prime Minister again, Darwin and some of his friends petitioned Gladstone to award a pension to Alfred Russel Wallace, who was in dire financial straits at the time. Gladstone granted the request. Two months later Gladstone offered Darwin a position as trustee of the British Museum but Darwin declined. (Remember, Gladstone did not agree with Darwin about evolution, or religion.)
When Darwin died, Gladstone was instrumental in arranging for him to be buried in Westminster Abbey. The funeral was held on April 26, 1882. William Gladstone was too busy to attend. He went to a dinner at Windsor.
Brown, J. (2002) Charles Darwin: The Power of Place (Vol. II). Alfred A. Knopf, New York (USA)
Repression of the lac Operon
There are many lesson to be learned from understanding the regulation of transcription of a well-studied system like the E. coli lac operon. Some of those lessons have consequences when we think about the problems of having large eukaryotic genomes. Read the description below and the implications that follow.
From Horton et al. (2006) p. 666
lac repressor binds simultaneously to two sites near the promoter of the lac operon. Repressor-binding sites are called operators. One operator (O1) is adjacent to the promoter, and the other (O2) is within the coding region of lacZ. When bound to both operators, the repressor causes the DNA to form a stable loop that can be seen in electron micrographs of the complex formed between lac repressor and DNA (bottom figure). The interaction of lac repressor with the operator sequences may block transcription by preventing the binding of RNA polymerase to the lac promoter. However, it is now known that, in some cases, both lac repressor and RNA polymerase can bind to the promoter at the same time. Thus, the repressor may also block transcription initiation by preventing formation of the open complex and promoter clearance. A schematic diagram of lac repressor bound to DNA in the presence of RNA polymerase is shown in the figure on the right. [See Monday's Molecule #61 for another view.] The diagram illustrates the relationship between the operators and the promoter and the DNA loop that forms when the repressor binds to DNA.
The repressor locates an operator by binding nonspecifically to DNA and searching in one dimension. (Recall from Section 21.3C that RNA polymerase also uses this kind of searching mechanism.) The equilibrium association constant for the binding of lac repressor to O1 in vitro is very high. As a result, the repressor blocks transcription very effectively. (lac repressor binds to the O2 site with lower affinity.) A bacterial cell contains only about 10 molecules of lac repressor, but the repressor searches for and finds an operator so rapidly that when a repressor dissociates spontaneously from the operator, another occupies the site within a very short time. However, during this brief interval, one transcript of the operon can be made since RNA polymerase is poised at the promoter. This low level of transcription, called escape synthesis, ensures that small amounts of lactose permease and β-galactosidase are present in the cell.
In the absence of lactose, lac repressor blocks expression of the lac operon, but when β-galactosides are available as potential carbon sources, the genes are transcribed. Several β-galactosides can act as inducers. If lactose is the available carbon source, the inducer is allolactose, which is produced from lactose by the action of β-galactosidse (Figure 21.18). Allolactose binds tightly to lac repressor and causes a conformational change that reduces the affinity of the repressor for the operators. [see Regulation of Transcription] In the presence of the inducer, lac repressor dissociates from the DNA, allowing RNA polymerase to initiate transcription. (Note that because of escape synthesis, lactose can be taken up and converted to allolactose even when the genes are repressed.)
However, the repressor will eventually fall off (dissociation rate constant k-1 = 6 × 10-4 s-1) and, as described above, the operon will be transcribed once (escape synthesis). A new repressor molecule finds the operator sequences very quickly because lac repressor binds non-specifically to DNA (KB = 4 × 104) and slides along the DNA searching for the operator in a process called one dimensional diffusion (association rate constant k1 = 1010 M-1 s-1). Even though the lac repressor only remains bound non-specifically for a few seconds, it is able to search about 2000 bp looking for a specific binding site.
Given the huge difference between the specific and non-specific binding constants, the cell only needs about ten molecules of lac repressor to ensure that the operator sequences are bound almost all of the time. At any given time nine of these molecules will be bound to random pieces of DNA in the genome and the other one will be bound to the lac operon.
Similar repressors and activators work in eukaryotic cells to regulate transcription. But in eukayotic cells we have a much bigger problem. First, there are very few regulatory proteins that have as strong a specific binding constant as lac repressor. Second, there is much more DNA in a eukaryotic cell. The consequences of having a large genome are: (a) it takes these DNA binding proteins much longer to find their specific binding site, and (b) at any one time, many more of the regulatory proteins are soaked up in non-specific binding to DNA. In eukaryotic cells with an abundance of junk DNA a typical regulatory protein has to be present at about 20,000 copies per cell in order to have a decent chance of biding to its specific regulatory site for a significant length of time. (Recall that only ten molecules of lac repressor are needed in E. coli.)
Given the properties of DNA binding that we have discovered and characterized in bacteria and bacteriophage, we can calculate that escape synthesis in eukaryotic cells in likely to be much more of a problem than in bacterial cells. Furthermore, accidental transcription of random bits of DNA is almost certainly going to be common in a cell with a large bloated genome. This is because RNA polymerase also binds non-specifically to DNA and also because the larger the genome, the more likely you are to encounter promoter and regulatory sequences that just by chance happen to be close matches to real functional sequences. This is a very important concept and one that is not widely appreciated. Based on our knowledge of basic biochemistry we expect that there will be random, infrequent transcription of a large percentage of the genome. These transcripts are merely a consequence of the properties of DNA binding proteins and they have no biological significance.
Some of these problems in eukaryotes are mitigated by a separate level of regulation at the level of chromatin structure. Large regions of the chromosome can be masked from DNA binding proteins by formation of a tight heterochromatic complex of nucleosomes and DNA. Less compact complexes are formed in non-active regions of the genome where the DNA is less accessible but not invisible. When genes in a region are transcribed, the chromatin opens out into an open complex where the DNA is easily accessible to regulatory proteins. This solves some of the problems discussed above but it is only a partial solution. We know for a fact that the concentrations of regulatorty proteins are high (20,000 copies) and a growing amount of evidence points to frequent accidental transcription.
©Laurence A. Moran and Pearson Prentice Hall
From Horton et al. (2006) p. 666
lac repressor binds simultaneously to two sites near the promoter of the lac operon. Repressor-binding sites are called operators. One operator (O1) is adjacent to the promoter, and the other (O2) is within the coding region of lacZ. When bound to both operators, the repressor causes the DNA to form a stable loop that can be seen in electron micrographs of the complex formed between lac repressor and DNA (bottom figure). The interaction of lac repressor with the operator sequences may block transcription by preventing the binding of RNA polymerase to the lac promoter. However, it is now known that, in some cases, both lac repressor and RNA polymerase can bind to the promoter at the same time. Thus, the repressor may also block transcription initiation by preventing formation of the open complex and promoter clearance. A schematic diagram of lac repressor bound to DNA in the presence of RNA polymerase is shown in the figure on the right. [See Monday's Molecule #61 for another view.] The diagram illustrates the relationship between the operators and the promoter and the DNA loop that forms when the repressor binds to DNA.
The repressor locates an operator by binding nonspecifically to DNA and searching in one dimension. (Recall from Section 21.3C that RNA polymerase also uses this kind of searching mechanism.) The equilibrium association constant for the binding of lac repressor to O1 in vitro is very high. As a result, the repressor blocks transcription very effectively. (lac repressor binds to the O2 site with lower affinity.) A bacterial cell contains only about 10 molecules of lac repressor, but the repressor searches for and finds an operator so rapidly that when a repressor dissociates spontaneously from the operator, another occupies the site within a very short time. However, during this brief interval, one transcript of the operon can be made since RNA polymerase is poised at the promoter. This low level of transcription, called escape synthesis, ensures that small amounts of lactose permease and β-galactosidase are present in the cell.
In the absence of lactose, lac repressor blocks expression of the lac operon, but when β-galactosides are available as potential carbon sources, the genes are transcribed. Several β-galactosides can act as inducers. If lactose is the available carbon source, the inducer is allolactose, which is produced from lactose by the action of β-galactosidse (Figure 21.18). Allolactose binds tightly to lac repressor and causes a conformational change that reduces the affinity of the repressor for the operators. [see Regulation of Transcription] In the presence of the inducer, lac repressor dissociates from the DNA, allowing RNA polymerase to initiate transcription. (Note that because of escape synthesis, lactose can be taken up and converted to allolactose even when the genes are repressed.)
Electron micrographs of DNA loops. These loops were formed by mixing lac repressor with a fragment of DNA bearing two synthetic lac repressor–binding sites. One binding site is located at one end of the DNA fragment, and the other is 535 bp away. DNA loops 535 bp in length form when the tetrameric repressor binds simultaneously to the two sites.The strength of binding between a protein and a ligand is measured by an equilibrium binding constant (KB). In the case of lac repressor binding to its specific strong binding site (O1) KB = 1013 M-1. This is very high, in fact it is one of the tightest DNA bindings known in biology. What this means is that lac repressor will sit on the operon and repress transcription for at least 20 minutes under normal conditions.
However, the repressor will eventually fall off (dissociation rate constant k-1 = 6 × 10-4 s-1) and, as described above, the operon will be transcribed once (escape synthesis). A new repressor molecule finds the operator sequences very quickly because lac repressor binds non-specifically to DNA (KB = 4 × 104) and slides along the DNA searching for the operator in a process called one dimensional diffusion (association rate constant k1 = 1010 M-1 s-1). Even though the lac repressor only remains bound non-specifically for a few seconds, it is able to search about 2000 bp looking for a specific binding site.
Given the huge difference between the specific and non-specific binding constants, the cell only needs about ten molecules of lac repressor to ensure that the operator sequences are bound almost all of the time. At any given time nine of these molecules will be bound to random pieces of DNA in the genome and the other one will be bound to the lac operon.
Similar repressors and activators work in eukaryotic cells to regulate transcription. But in eukayotic cells we have a much bigger problem. First, there are very few regulatory proteins that have as strong a specific binding constant as lac repressor. Second, there is much more DNA in a eukaryotic cell. The consequences of having a large genome are: (a) it takes these DNA binding proteins much longer to find their specific binding site, and (b) at any one time, many more of the regulatory proteins are soaked up in non-specific binding to DNA. In eukaryotic cells with an abundance of junk DNA a typical regulatory protein has to be present at about 20,000 copies per cell in order to have a decent chance of biding to its specific regulatory site for a significant length of time. (Recall that only ten molecules of lac repressor are needed in E. coli.)
Given the properties of DNA binding that we have discovered and characterized in bacteria and bacteriophage, we can calculate that escape synthesis in eukaryotic cells in likely to be much more of a problem than in bacterial cells. Furthermore, accidental transcription of random bits of DNA is almost certainly going to be common in a cell with a large bloated genome. This is because RNA polymerase also binds non-specifically to DNA and also because the larger the genome, the more likely you are to encounter promoter and regulatory sequences that just by chance happen to be close matches to real functional sequences. This is a very important concept and one that is not widely appreciated. Based on our knowledge of basic biochemistry we expect that there will be random, infrequent transcription of a large percentage of the genome. These transcripts are merely a consequence of the properties of DNA binding proteins and they have no biological significance.
Some of these problems in eukaryotes are mitigated by a separate level of regulation at the level of chromatin structure. Large regions of the chromosome can be masked from DNA binding proteins by formation of a tight heterochromatic complex of nucleosomes and DNA. Less compact complexes are formed in non-active regions of the genome where the DNA is less accessible but not invisible. When genes in a region are transcribed, the chromatin opens out into an open complex where the DNA is easily accessible to regulatory proteins. This solves some of the problems discussed above but it is only a partial solution. We know for a fact that the concentrations of regulatorty proteins are high (20,000 copies) and a growing amount of evidence points to frequent accidental transcription.
©Laurence A. Moran and Pearson Prentice Hall
Horton, H.R., Moran, L.A., Scrimgeour, K.G., perry, M.D. and Rawn, J.D. (2006) Principles of Biochemisty. Pearson/Prentice Hall, Upper Saddle River N.J. (USA)
The Lac Operon
The lac operon in E. coli consists of three genes1 (lacZ, lacY and lacA) transcribed from a single promoter. The lacZ gene encodes the enzyme β-galactosidase, an enzyme that cleaves β-galactosides. Lactose is a typical β-galactoside and the enzyme cleaves the disaccharide converting it to separate molecules of glucose and galactose. These monosacharides can enter into the metabolic pool of the cell where they can serve as the sole source of carbon.
Thus, when the lac operon is active and β-galactosidase is present, E. coli can grow on lactose as its only source of carbon. Outside of the laboratory, E. coli rarely encountered lactose (until recently) but there are many plant β-galactosides that are substrates for the enzyme.
LacY encodes a famous transporter called lactose permease. It is responsible for importing βgalactosides. The lacA gene encodes a transacetylase that is responsible for detoxifying the cell when it takes up poisonous β-galactosides.
Transcription begins at the Plac promoter and ends at a terminator at the 3′ end of the operon. Each of the three reading frames is translated separately from the polycistronic mRNA.
Upstream of the lac operon is the lacI gene. It encodes the lac repressor, one of the proteins that controls expression of the lac operon. The lacI gene is transcribed from its own promoter and it has its own terminator. (It is not necessary for the lacI gene to be linked to the operon.)
Expression of β-galactosidase, lac permease, and the transacetylase is regulated at the level of transcription. RNA polymerase binds to the lac promoter but this is a weak σ70 promoter.2. The promoter sequence is a poor match to the consensus sequence for these types of promoters so the operon is transcribed infrequently in the absence of additional activators. Transcription of the operon is activated by cAMP regulatory (or receptor) protein (CRP).3
In the absence of any β-galactoside, the operon is not transcribed and no enzyme is synthesized. Transcription is prevented by lac repressor, which binds to two operator sequences called O1 and O2. When β-galactosides are present repression is relived and the operon is transcribed at a low level in order to take advantage of the carbon source. When there is no other carbon source available, the operon is activated by CRP and the rate of transcription—and enzyme production—increases considerably.
Thus, when the lac operon is active and β-galactosidase is present, E. coli can grow on lactose as its only source of carbon. Outside of the laboratory, E. coli rarely encountered lactose (until recently) but there are many plant β-galactosides that are substrates for the enzyme.
LacY encodes a famous transporter called lactose permease. It is responsible for importing βgalactosides. The lacA gene encodes a transacetylase that is responsible for detoxifying the cell when it takes up poisonous β-galactosides.
Transcription begins at the Plac promoter and ends at a terminator at the 3′ end of the operon. Each of the three reading frames is translated separately from the polycistronic mRNA.
Upstream of the lac operon is the lacI gene. It encodes the lac repressor, one of the proteins that controls expression of the lac operon. The lacI gene is transcribed from its own promoter and it has its own terminator. (It is not necessary for the lacI gene to be linked to the operon.)
Expression of β-galactosidase, lac permease, and the transacetylase is regulated at the level of transcription. RNA polymerase binds to the lac promoter but this is a weak σ70 promoter.2. The promoter sequence is a poor match to the consensus sequence for these types of promoters so the operon is transcribed infrequently in the absence of additional activators. Transcription of the operon is activated by cAMP regulatory (or receptor) protein (CRP).3
In the absence of any β-galactoside, the operon is not transcribed and no enzyme is synthesized. Transcription is prevented by lac repressor, which binds to two operator sequences called O1 and O2. When β-galactosides are present repression is relived and the operon is transcribed at a low level in order to take advantage of the carbon source. When there is no other carbon source available, the operon is activated by CRP and the rate of transcription—and enzyme production—increases considerably.
1. This is one of the exceptions to the standard definition of a gene [What Is a Gene?]. In this case we are using the word "gene" to mean the coding region for a particular protein.
2. There are many different promoters in the E. coli genome. They are recognized by various RNA polymerase complexes containing different bound activators. One set of common activators is called σ factors: σ70 is the most common σ factor. Most genes have a σ70 promoter.
3. CRP is also known as catabolite activator protein (CAP).
Monday, February 11, 2008
Monday's Molecule #62
Today's molecule is a cartoon depicting the action of several molecules. Your task is to identify all the molecules in the diagram and explain what's going on. Even if you're not interested in a free lunch, I'd appreciate hearing from you. I'd like to know how many of you understand the diagram. In fact, I'll put a poll in the sidebar to see how many recognize the process that's depicted here.
There's an indirect connection between this molecule and Wednesday's Nobel Laureate(s). Your task is to figure out the significance of today's diagram and identify the Nobel Laureate(s) who is associated with discovering the underlying process. (Be sure to check previous Laureates.)
The reward goes to the person who correctly identifies the molecule and the Nobel Laureate(s). Previous winners are ineligible for one month from the time they first collected the prize. There are three ineligible candidates for this week's reward. The prize is a free lunch at the Faculty Club.
THEME:
Nobel Laureates
Send your guess to Sandwalk (sandwalk(at)bioinfo.med.utoronto.ca) and I'll pick the first email message that correctly identifies the molecules, the process, and the Nobel Laureate(s). Note that I'm not going to repeat Nobel Laureates so you might want to check the list of previous Sandwalk postings.
Correct responses will be posted tomorrow along with the time that the message was received on my server. I may select multiple winners if several people get it right.
Comments will be blocked for 24 hours.
Sunday, February 10, 2008
Stu Kauffman in Toronto
I went to hear Stu Kauffman on Friday night [see Reinventing the Sacred].
Before the talk we had a little chat about blogging and some other topics. He wondered what the bloggers were saying about him and I told him that many don't understand what he's trying to say. I explained that I fell into that same category. I can't figure out what it is that he's trying to promote. He promised to try and explain in his talk.
It didn't work. I'm not much further ahead than I was before I heard him talk. Here's a brief summary of some things he said. I'm sorry if I can't put it all together into one big picture but I just can't.
The New Atheists: Kauffman thinks that Dawkins and his "New Atheist" friends are preaching to the converted. According to Kauffman, they will never convince the believers. Kauffman describes himself as a secular humanist and a non-believer. He thinks we should try to reach out to the religious community by adopting spiritual language. Hence the title of his talk. I don't really know what he means by this. He gave one example of having a reverence for some trees growing on a hill top near his house but I'm not sure if this is relevant. (See photograph, is that the hill top?)
I don't agree with his position on the so-called New Atheists and I don't agree with his proposal that it's the atheists who need to move towards the theists by adopting the sacred.
Reductionism: Kauffmann is very much opposed to reductionism. He spent some time describing how the laws of physics just don't work when you try and predict the structure of complex things. This does not mean they don't obey the laws of physics and chemistry, it means those laws aren't sufficient. This is because of emerging properties.
The discussion about reductionism and emergent properties is interesting but Kauffman makes it too complicated, for me, by going off on all kinds of tangents. In talking about it with him afterwards, he seems to be thinking that life is somehow special. It's different than the physical world. He takes pains to point out that he's not talking about vitalism but it sure sounds like that to me.
The other interesting thing about his anti-reductionism is that it doesn't apply in the same sense that Lewontin means when he talks about gene-centric biology. Before the lecture we were discussing the reason why human siblings don't mate and Kauffman was quite eager to offer an evolutionary psychology explanation. He suggested there was selection for an anti-incest gene in our ancestors to prevent inbreeding. That's the worst kind of reductionism but it's not the sort of reductionism that Kauffman disputes.
Determinism: Kauffman doesn't like determinism. He pointed out that quantum mechanics has ruled out the Laplace version of determinism. I don't think this is particularly controversial but I do think there are versions of determinism that don't require strict predictability. I kept waiting for the other shoe to drop. I don't think Kauffman was trying to make a case for free will and I don't think he was using his anti-determinism to argue against materialism, but I'm not sure.
Somehow these topics, and several others, were supposed to weave together to form a new way of looking at science. And a new way of reaching out to theists. That's the part I didn't get. A lot of what he was saying was true, but hardly profound. What was supposed to be profound didn't seem to be true.
Stu Kauffman took down the URL for Sandwalk and he promised to read my comments on his lectures. I hope he will respond in the comments. He seems like a pretty cool guy even if he's a bit baffling.
The dominant impression I have from talking to members of the audience—there were 65 people at the talk—is that people think he's saying something important but they just can't put their finger on what it is. At least I'm not the only one.
Where does disbelief in Darwin lead?
You probably think the answer to the question is obvious. The rejection of science leads to irrational behavior, right?
Of course it's right. DaveScot sets out to prove it over on Uncommon Descent with a posting that has the same title as this one [Where does disbelief in Darwin lead?]. As you read it, remember that the person who is writing the article is a disbeliever in evolution. Let's see where that kind of thinking leads ....
Be that as it may I’m a results oriented guy. Instead of presuming that “poorer” science education leads to poorer scientific output I instead look at what America actually produces in the way of science and engineering. Without question America’s output in science and engineering leads the world. Not just a little but a lot. We don’t steal nuclear technology secrets from China, they steal ours. We don’t use European GPS satellites for navigation, they use ours. The list can go on and on. We put a man on the moon 40 years ago while to this day no one else has. America has almost 3 times the number of Nobel prize winners as the next closest nation. That doesn’t support the notion that disbelief in Darwin is causing any problems. In fact it supports just the opposite. Disbelief in evolution makes a country into a superpower - militarily, economically, and yes even scientifically.Well, there you have it. If only those successful scientists, engineers, and Nobel Laureates1 would stop believing in evolution there's no limit to what America could achieve. Just look at how far America has come when it's only the ignorant who disbelieve in evolution!
Education in America is working just fine, thank you, judging by the fruits of American science and engineering. Disbelief in Darwinian evolution, if anything, leads to greater technological achievements not lesser. If it isn’t broken, don’t try to fix it.
You know, you simply can't make this stuff up.
1. America is pretty much in the middle of the pack in terms of Nobel Laureates per capita [Nobel Prizes by Country]. It takes a bit of intelligence and simple math to recognize that point.
What Freedom of Speech Really Means
Read the amazing story on Friendly Atheist [Atheist Billboard Taken Down].
The Freedom from Religion Foundation contracted with Kegerreis Outdoor Advertising LLC to put up the following billboard in Chambersburg, Pennsylvania (USA).
That billboard has now been taken down and replaced with,
Read what the company has to say about their decision. It should not be necessary to point out what freedom of speech means and it is not proper for an advertising company to publicly state their moral values. Do all employees of Kegerreis Outdoor Advertising LLC agreed with the statement? If not, are they going to make their views known at company headquarters? Would you?
A Case of Plagiarism
The blogosphere is all atwitter about the publication of a paper titled "Mitochondria, the missing link between body and soul: Proteomic prospective evidence". This is the train wreck of a paper that PZ Myers blogged about a few days ago [What Happened to the "Peers" on this Paper?].
Everyone needs to know that the contents of this paper were not only stupid but also plagiarized. The authors couldn't even come up with their own words to explain their silly ideas. For the latest additions to a long list of stolen paragraphs see Commentary: Neither buried nor treasure.
The guilty journal is Proteomics. The editors are not blameless.
The Streisand Effect in Action
I mentioned the Streisand Effect a few days ago. Here's a perfect example of how it works.
ThePolitic.cmo is one of those Canadian blogs written by someone (Matthew) who embarrasses my country. Matthew writes,
It’s no secret that many of us liberty and/or family-minded folks are great fans of The National Post which officially only competes with the Globe and Mail but realistically also occupies reality that the Toronto Star and Toronto Sun covet. I personally began subscribing to the Post after graduation not because it had a host of right-wing commentators (the Toronto Sun can also claim this), but because the paper took the mission of presenting all view points seriously by often welcoming guest columnists who would attack its editorials, or by presenting series like the one they did two weeks ago on abortion, where a dozen commentators would weigh in on the issue with intelligent, but different viewpoints.For more information about the Love & Sex issue go to the National Post website [Love & Sex Issue].
This led me to great sadness today when I went onto their website to read the digital version of the paper. The front cover was just a large cartoon title that said “The Love & Sex Issue” which is tastefully questionable in itself for a national newspaper, but if you look at the picture itself, it also contains the drawing of two nude people behind the “x” which those of us familiar with Japanese pop-culture would classify as hentai. Half of the main section contained articles which were more at home in a Penthouse issue and the Post’s website contains video content that I dare not look at but is clearly part of the above-mentioned theme.
I have since called the Post’s office and dealt with a nice young chap who will be passing along my complaints (the Post is good at responding to these), but in the mean time, I invite everyone else who is disturbed by this extremely poor lack in judgment to write or phone to the Post’s editorial staff:
Warning: There may be
Mathew should just be thankful that this topic wasn't covered in one of the left-wing newspapers like the Toronto Star. He probably would have had to leave the country for a few days to avoid the newspaper boxes.
[Hat Tip: Canadian Cynic]
God Only Knows
I found a new Canadian blog called Stony_Curtis: Pop Culture, Guys, Food, Montreal. With a title like that, don't you just have to click on the link?
One of the first things I stumbled across was the video below of the Beach Boys singing "God Only Knows" from the Pet Sounds album. I don't know who Stony Curtis is but I like him already. (P.S. Somehow the song doesn't seem quite as innovative and moving as it was 40 years ago. I wonder why.)
The Meaning of Consensus
What does Stephen Harper mean when he uses the word "consensus" as in,
The Conservative government will not extend Canada's combat mission in Afghanistan beyond February 2009 without a consensus in Parliament, Prime Minister Stephen Harper said Friday.Canadian Cynic has the answer. [Hint: Harper doesn't mean what you think he means.]
"I will want to see some degree of consensus among Canadians on how we move forward on that," Harper told reporters Friday in Ottawa.
Saturday, February 09, 2008
Junk in Your Genome: Intron Size and Distribution
In the comments to Junk in Your Genome: Protein-Encoding Genes martinc asks,
I suggested that much of the intron sequences were junk. Martinc's question is quite reasonable but in order to get an answer we need to look more closely at the distribution of introns.
The figure shows the distribution of intron sizes in four species: the flowering plant Arabidopsis thaliana; the fruit fly Drosophila melanogaster; human, and mouse. The data is from Hong et al. (2006, Fig.1).
Note that the distribution in Arabidopsis and Drosophila is very tight. Both of these species have relatively compact genomes compared to mammals. The data strongly suggests that the minimum intron size is about 80 bp.
The distributions in the human and mouse genomes are very different. There is a strong peak at 100 bp—this is similar to the peaks in other species. But unlike other species, mammalian introns can be extremely large, giving rise to a long tail of the distribution extending to 10,000 bp or more. The key question is whether this distribution of long introns is noise or an artifact of gene prediction algorithms, or whether it represents a real phenomenon.
Returning to martinc's question. If we look at well-conserved genes in different species what we find is some variation in intron length but only around a mean of about 100-400 bp. In other words, in genes that have been closely examined, where the protein product is known, the distribution of intron sizes looks a lot more like the distribution in Arabidopsis and Drosophila.
Let's look at the hsp90 genes. These are the genes that endcode Hsp90, the protein that SciPhu was blogging about [Hsp90 and Evolution].
I've picked the zebrafish gene and four mammalian genes to illustrate the variation in intron length. (Blue exons are 5′ and 3′ UTR's.) Most of the introns are between 80 and 400 bp in size but there are a few exceptions. In this case the human gene is the exception; it has two huge introns at the 5′ end of the gene.
What we see is a narrow distribution of intron lengths in most cases and a few huge introns. It isn't surprising that the length of introns in different species are quite similar.
Let's look at my favorite gene. HSPA8 is the cytoplasmic version of the chaperone HSP70 multigene family.
We see a similar pattern. Most intron lengths are very similar in different species suggesting selection for introns in the 100-400 bp range. There are exceptions, as we see in the chimpanzee, monkey and dog genes. All three have large introns at either the 5′ or 3′ ends. The large monkey inrons are 10,253 bp and 1007 bp. The large chimpanzee intron is 13,257 bp in length. This is typical. I think it's very likely that the large introns in noncoding exons are artifacts.
So here's the complete answer to the question posed at the top of the page. I think there's selection to maintain introns sizes to a fairly narrow range of between 100-400 bp. Because of this, we expect to see similar intron sizes in different species. On occasion we discover a huge intron that is peculiar to one species. This intron could be a transient expansion that hasn't been reduced yet, or it could be an artifact.
Incidentally, while retrieving these sequences from Entrez Gene I noticed that the annotators have eliminated all spice variants for HSP90 and HSPA8 genes with a few exceptions.
The dog sequences all have many splice variants for every gene and some of the variants have been retained in Entrez Gene entry for dog HSPA8. Look carefully at the two predicted variants in the seond and third lines. These alternative splice variants are supposed to produce Hsc70 proteins that are missing several highly conserved regions encoded by exons 7 and 8. Recall that this is the most highly conserved protein in biology.
These cannot be biologically relevant protein variants that are only produced in dogs. The annotators are right to remove similar artifacts from the other genomes and they should remove these as well. Alternative splice variants are mostly artifacts, in my opinion, but that's a fight for another day.
Larry, if the amount of necessary sequences within introns are as small as you suggest wouldn't this allow us to make a prediction. Couldn't we predict that due to drift there should be very little similarity in intron lengths between different species. If, by any chance, there is similarity then what would your explanation be?There have been quite a few studies of average intron size in various species. I selected a number for the average size of introns from Hong et al. (2006). The average intron size, according to them, is 3,479 bp in coding regions. This value is a little deceptive since there are a small number of huge introns that make the average quite large. The median value is 1334 bp or less than half the average value.
I suggested that much of the intron sequences were junk. Martinc's question is quite reasonable but in order to get an answer we need to look more closely at the distribution of introns.
The figure shows the distribution of intron sizes in four species: the flowering plant Arabidopsis thaliana; the fruit fly Drosophila melanogaster; human, and mouse. The data is from Hong et al. (2006, Fig.1).
Note that the distribution in Arabidopsis and Drosophila is very tight. Both of these species have relatively compact genomes compared to mammals. The data strongly suggests that the minimum intron size is about 80 bp.
The distributions in the human and mouse genomes are very different. There is a strong peak at 100 bp—this is similar to the peaks in other species. But unlike other species, mammalian introns can be extremely large, giving rise to a long tail of the distribution extending to 10,000 bp or more. The key question is whether this distribution of long introns is noise or an artifact of gene prediction algorithms, or whether it represents a real phenomenon.
Returning to martinc's question. If we look at well-conserved genes in different species what we find is some variation in intron length but only around a mean of about 100-400 bp. In other words, in genes that have been closely examined, where the protein product is known, the distribution of intron sizes looks a lot more like the distribution in Arabidopsis and Drosophila.
Let's look at the hsp90 genes. These are the genes that endcode Hsp90, the protein that SciPhu was blogging about [Hsp90 and Evolution].
I've picked the zebrafish gene and four mammalian genes to illustrate the variation in intron length. (Blue exons are 5′ and 3′ UTR's.) Most of the introns are between 80 and 400 bp in size but there are a few exceptions. In this case the human gene is the exception; it has two huge introns at the 5′ end of the gene.
What we see is a narrow distribution of intron lengths in most cases and a few huge introns. It isn't surprising that the length of introns in different species are quite similar.
Let's look at my favorite gene. HSPA8 is the cytoplasmic version of the chaperone HSP70 multigene family.
We see a similar pattern. Most intron lengths are very similar in different species suggesting selection for introns in the 100-400 bp range. There are exceptions, as we see in the chimpanzee, monkey and dog genes. All three have large introns at either the 5′ or 3′ ends. The large monkey inrons are 10,253 bp and 1007 bp. The large chimpanzee intron is 13,257 bp in length. This is typical. I think it's very likely that the large introns in noncoding exons are artifacts.
So here's the complete answer to the question posed at the top of the page. I think there's selection to maintain introns sizes to a fairly narrow range of between 100-400 bp. Because of this, we expect to see similar intron sizes in different species. On occasion we discover a huge intron that is peculiar to one species. This intron could be a transient expansion that hasn't been reduced yet, or it could be an artifact.
Incidentally, while retrieving these sequences from Entrez Gene I noticed that the annotators have eliminated all spice variants for HSP90 and HSPA8 genes with a few exceptions.
The dog sequences all have many splice variants for every gene and some of the variants have been retained in Entrez Gene entry for dog HSPA8. Look carefully at the two predicted variants in the seond and third lines. These alternative splice variants are supposed to produce Hsc70 proteins that are missing several highly conserved regions encoded by exons 7 and 8. Recall that this is the most highly conserved protein in biology.
These cannot be biologically relevant protein variants that are only produced in dogs. The annotators are right to remove similar artifacts from the other genomes and they should remove these as well. Alternative splice variants are mostly artifacts, in my opinion, but that's a fight for another day.
Hong X, Scofield DG, Lynch M (2006) Intron size, abundance, and distribution within untranslated regions of genes. Mol. Biol. Evol. 23:2392-404. [PubMed]
Friday, February 08, 2008
Junk in Your Genome: Protein-Encoding Genes
The typical human gene has eight exons and seven introns (the actual average number of introns is 7.2). These values are based on analysis of 5236 well-characterized human genes with full-length cDNA's (Hong et al. 2006). There are lots of conflicting results in the literature. Most claim there are more introns but the data is based largely on a computational assessment of introns and exons. It includes a number of introns of extraordinary length lying between exons of dubious existence (often non-coding). I'll assume for the time being that there are 7.2 introns per gene, on average, and the average length is 3750 bp (Hong et al. 2006)
Each gene is transcribed from a 5′ promoter (P) and the primary transcript terminates at a polyadenylation site (t).
THEME
Genomes & Junk DNA
Total Junk so far
55%
The exons contain coding regions (blue) that encode the sequence of the protein product. A typical protein has a molecular weight of 70,000 daltons and this corresponds to about 635 amino acid residues. The coding region is 1905 bp but we'll round up to 2 kb. Each gene has a region of the mRNA at the 5′ end called the 5′ untranslated region (UTR). This is required for translation. It averages 200 bp in size, with considerable variation. The 3′ end of the gene has a similar untranslated region that we'll assume to be essential.
Thus, total essential exons comprise 2200 bp on average per gene. Since there are 20,500 protein-encoding genes, this means 20,500 × 2.2 kb = 45.1 Mb or 1.4% of the genome (about 1.3% coding and 0.1% UTRs).
The minimum size of a eukaryotic intron is less than 50 bp. For a typical mammalian intron, the essential sequences in the introns are: the 5′ splice site (~10 bp); the 3′ splice site (~30 bp): the branch site (~10 bp); and enough additional RNA to form a loop (~30 bp). This gives a total of 80 bp of essential sequence per intron or 20,500 × 7.2 × 80 = 11.8 Mb. Thus, 0.37% of the genome is essential because it contains sequences for processing RNA.
The total of essential sequences in the transcribed part of a gene is about 1.8% of the genome.
The rest of the intron sequence is non-essential junk. Much of it is littered with transposable elements that have inserted haphazardly. If we subtract the essential intron sequence then the average size of the remaining DNA is 3650 bp. The total amount of this sequence is 20,500 × 7.2 × 3650 = 538.7 Mb or 17% of the genome. (Most estimates are somewhat higher.)
Assuming that 44% of this is repetitive transposable elements, this leaves7.4% 9.6% of the genome. That's an additional 7.4% 9.6% of non-essential DNA, or junk, bringing our current total to 53% 55% junk.
The transcription of every gene is controlled by sequences beyond the 5′ end. There are two classes of sequence; promoters, and regulatory sequences. The actual binding sites for RNA polymerase II and various regulatory proteins make up only about 100 bp of essential sequence but the various bound proteins have to form loops of DNA in order to come into contact. It's reasonable to assume that the average gene may need as much as 1000 bp of essential regulatory sequence. (A generous estimate.)
This means 20,500 × 1000 bp = 20.5 Mb or 0.6% of the genome is essential for regulation.
The grand totals for protein-encoding genes are:
essential 2.4%
junk7.4% 9.6% (not counting sequences that were included in other calculations)
Each gene is transcribed from a 5′ promoter (P) and the primary transcript terminates at a polyadenylation site (t).
THEME
Genomes & Junk DNA
Total Junk so far
55%
The exons contain coding regions (blue) that encode the sequence of the protein product. A typical protein has a molecular weight of 70,000 daltons and this corresponds to about 635 amino acid residues. The coding region is 1905 bp but we'll round up to 2 kb. Each gene has a region of the mRNA at the 5′ end called the 5′ untranslated region (UTR). This is required for translation. It averages 200 bp in size, with considerable variation. The 3′ end of the gene has a similar untranslated region that we'll assume to be essential.
Thus, total essential exons comprise 2200 bp on average per gene. Since there are 20,500 protein-encoding genes, this means 20,500 × 2.2 kb = 45.1 Mb or 1.4% of the genome (about 1.3% coding and 0.1% UTRs).
The minimum size of a eukaryotic intron is less than 50 bp. For a typical mammalian intron, the essential sequences in the introns are: the 5′ splice site (~10 bp); the 3′ splice site (~30 bp): the branch site (~10 bp); and enough additional RNA to form a loop (~30 bp). This gives a total of 80 bp of essential sequence per intron or 20,500 × 7.2 × 80 = 11.8 Mb. Thus, 0.37% of the genome is essential because it contains sequences for processing RNA.
The total of essential sequences in the transcribed part of a gene is about 1.8% of the genome.
The rest of the intron sequence is non-essential junk. Much of it is littered with transposable elements that have inserted haphazardly. If we subtract the essential intron sequence then the average size of the remaining DNA is 3650 bp. The total amount of this sequence is 20,500 × 7.2 × 3650 = 538.7 Mb or 17% of the genome. (Most estimates are somewhat higher.)
Assuming that 44% of this is repetitive transposable elements, this leaves
The transcription of every gene is controlled by sequences beyond the 5′ end. There are two classes of sequence; promoters, and regulatory sequences. The actual binding sites for RNA polymerase II and various regulatory proteins make up only about 100 bp of essential sequence but the various bound proteins have to form loops of DNA in order to come into contact. It's reasonable to assume that the average gene may need as much as 1000 bp of essential regulatory sequence. (A generous estimate.)
This means 20,500 × 1000 bp = 20.5 Mb or 0.6% of the genome is essential for regulation.
The grand totals for protein-encoding genes are:
essential 2.4%
junk
Hong X, Scofield DG, Lynch M (2006) Intron size, abundance, and distribution within untranslated regions of genes. Mol. Biol. Evol. 23:2392-404. [PubMed]
Hsp90 and Evolution
The SciPhu blog has an interesting series of posts on the chaperone Hsp90 and it's effect on evolution. Here's the list of article with the links.
My contribution to JustScience 2008 will be a review on a protein with the potential to transform evolution theory as we know it today. The review will be divided into 5 separate blog posts:I don't agree with the main conclusion that Hsp90 has an important role in evolution but the case is well presented. Anyone who wants to know more about hopeful monsters should read these articles.
UPDATE: My comment above may be misinterpreted. Sciphu may be right about "hopeful monsters" but he's dead wrong to confuse punctuated equilibria and macromutations. He makes the same mistake that others routinely make [Macromutations and Punctuated Equilibria].
We often criticize the creationists for misunderstanding punctuated equilibria and confusing it with the lack of transitional fossils. I think we should be just as hard on evolutionists who make the same mistake.
The figure shows the structures of Hsp90 from yeast, dog, human and E. coli [HSP90 Structure]