More Recent Comments

Showing posts sorted by relevance for query domains. Sort by date Show all posts
Showing posts sorted by relevance for query domains. Sort by date Show all posts

Thursday, March 15, 2007

The Molecular Basis of Roundup® Resistance

Recall that glyphosate inhibits the enzyme EPSP synthase, an enzyme that catalyzes the following reaction in the chorsimate biosynthesis pathway [How Roundup® Works].

Funke et al. (2006) explored the molecular basis of this inhibition by looking at the structure of EPSP synthase from the C4 strain of Agrobacterium sp. This is the resistant form of the enzyme that has been genetically engineered into Roundup Ready® plants [Roundup Ready® Transgenic Plants].

Note that the structure of glyphosate resembles one of the substrates of the reaction; namely phosphoenolpyruvate (PEP). It was already known that glyphosate binds tightly to the active site of the enzyme and inhibits the reaction by preventing PEP binding. As it turns out, the site for glyphosate binding is exactly the same as the site for PEP binding and this explains the inhibition.

Funke et al. (2006) looked at the C4 EPSP enzyme with and without one of the other substrates: namely, shikimate-3-phosphate (sometimes called shikimate-5-phosphate). The results reveal the precise location of the active site of the enzyme at the base of a cleft between two domains. This form of the enzyme is called class II EPSP synthase because it is distantly related to the class I enzymes in other bacteria and eukaryotes (30% amino acid sequence identity). This is the first paper to examine the structure of a class II enzyme.

As an aside, notice that the enzyme closes up a little bit when the substrate binds—sort of like a Pacman icon. This mechanism of substrate binding is called induced fit and it's proving to be more common than most people realized.

The glyphosate resistant (Roundup Ready®) mutation in C4 EPSP synthase is a substitution of Alanine (A) for Glycine (G) at amino acid position 100. The glyphosate molecule fits nicely into the wild type G100 form of the enzyme (lower image) and it excludes PEP binding completely. Note that glyphosate (green) is in an extended configuration when it is bound. The dotted lines represent non-covalent interactions between the enzyme and the glyphosate molecule. The blue dots are "frozen" water molecules embedded in the active site.

In the mutant form of the enzyme the extra methyl group on alanine is just big enough to cause glyphosate to distort so it can no longer lie in the optimal extended configuration (top image). This means that glyphosate binds much more weakly and doesn't inhibit enzyme activity.

The important point is that the active site can still accommodate phosphoenolpyruvate because it is smaller than glyphosate. What this means is that the overall activity of the enzyme in the absence of glyphosate is unaffected. There are lots of EPSP synthase mutants that don't bind glyphosate but in almost all cases the rate of the reaction is drastically reduced because PEP binding is also weakened. For example, if you mutate the glycine to alanine at the equivalent position in other bacterial or plant enzymes you abolish PEP binding along with glyphosate binding.

What's special about the class II enzymes in general and the Agrobacterium sp. enzyme in particular, is that the amino acids surrounding the PEP binding pocket are positioned just right so that a slight shift can exclude glyphosate without affecting phosphoenolpyruvate. This is mostly due to the positions of the charged amino acid side chains that form weak interactions with the oxygen atoms and the nitrogen of glyphosate; for example, arginines (R) at 128, 357, and 405; lysine (K) at 28; and glutamate (E) at 354.

The results of this study not only shed light on the mechanism of glyphosate resistance but they also help explain the lack of Roundup® resistant plants. Apparently, the class I enzymes in plants have a binding pocket that is difficult to mutate in a way that excludes glyphosate while still allowing PEP binding. Nevertheless, some examples of Roundup® resistant plants are known. I'll describe them tomorrow.

(Funke et al. had to do a bit of sleuthing and reconstruction in order to solve the structure of the C4 EPSP synthase. The C4 strain of Agrobacterium sp. has, naturally enough, not been given out to scientists outside of Monsanto laboratories. So Funke et al. got the amino acid sequence from US Patent 5633435 and reverse engineered the nucleotide sequence of the gene. They synthesized the nucleotide sequence and amplified the fragments by PCR. They then tacked on a promoter and a transcription termination signal and cloned the articfial gene into an E. coli plasmid. The artificially reconstructed protein was then expressed in E. coli, isolated, purified, and crystallized.)
Funke, T., Han, H., Healy-Fried, M,L., Fischer, M., and Schonbrunn, E. (2006) Molecular basis for the herbicide resistance of Roundup Ready crops. Proc. Natl. Acad. Sci. (USA) 103:13010-13015. [PubMed]

Tuesday, March 30, 2010

The "Mutationism" Myth I. The Monk's Lost Code and the Great Confusion

This is the second in a series of postings by a guest blogger, Arlin Stoltzfus. You can read the first part at: Introduction to "The Curious Disconnect". Arlin is challenging the status quo in modern evolutionary theory. He's not alone in this challenge but it's important to distinguish between kooks who don't know what they're talking about and serious thinkers who have something to say. Arlin is going to explain to you why everything you thought you knew about mutationism is wrong. I'm happy to give him a chance to post on Sandwalk.

This will be on the exam.



The Curious Disconnect


The Curious Disconnect is the blog of evolutionary biologist Arlin Stoltzfus, available at www.molevol.org/cdblog. An updated version of the post below will be maintained at www.molevol.org/cdblog/mutationism_myth1 (Arlin Stoltzfus, ©2010)

The "Mutationism" Myth I. The Monk's Lost Code and the Great Confusion


The mutationism myth tells the story of how, just over a century ago, the scientific community responded to the discovery of Mendelian genetics by discarding Darwinism, and how Darwinism subsequently was restored.Our journey to explore The Curious Disconnect-- the gap between how we think about evolution and how we might think if we were freed from historical baggage-- begins with the Mutationism Myth. In this, the first of four parts, we are not going to confront any tough scientific or conceptual issues. Instead, we are just going to review an odd story about our intellectual history.

The Mutationism Story


While "myth" has the connotation of falsehood, the story that a myth tells isn't necessarily a false one. The mutationism myth, at least, is anchored in historical events.1

The mutationism myth tells the story of how, just over a century ago, the scientific community responded to the discovery of Mendelian genetics by discarding Darwinism, and how Darwinism subsequently was restored. The villains of the story are the influential early geneticists or "Mendelians" who saw genetics as a refutation of Darwinism; the heroes are first, the founders of population genetics, theoreticians who sorted everything out in favor of Darwinism by about 1930, and second, the architects of the Modern Synthesis, activists who popularized and institutionalized what we're calling "Darwinism 2.0".

This story has been re-told in secondary sources for nearly 50 years, though I sense that the frequency is decreasing as this episode passes into ancient history. To find examples, try looking up "mutationism" (sometimes "Mendelism" or even "saltationism") in the index of a book about evolution.

I encourage you to consult whatever sources you have and to share the stories that you find. Note that you won't always be successful. A quick survey of several dozen contemporary books on my shelf reveals that most don't address this episode specifically (a notable absence, in some cases 2); some tell the mutationism myth with varying degrees of panache; and a few provide a historical account rather than a myth. The few historical accounts that I found were in Gould's 2002 The Structure of Evolutionary Theory, Strickberger's 1990 textbook Evolution, and the Wikipedia entry on "Mutationism".

Sample stories


Lets look at a few examples of the mutationism story. Readers who want to check out a freely available online source from the scholarly literature may refer to Ayala and Fitch, 1997 (http://www.ncbi.nlm.nih.gov/pubmed/9223250?dopt=Citation). One example that really caught my eye is not from scientific literature, but from the 2005 obituary for Ernst Mayr in The Economist:

It was not that biologists had given up on evolution by the 1940s-quite the contrary. But they had got very confused about its mechanism. . . . The geneticists of the early 20th century did not help. They rediscovered the laws of inheritance first developed 40 years earlier by Gregor Mendel, an unsung Moravian monk. They also discovered the idea of genetic mutation. But instead of linking these things to natural selection, they came up with the idea of "saltation"-in other words, sudden mutational shifts from one well-adapted species to another. Nor, the geneticists complained, had there been enough time for natural selection to do its work, given what they had discovered about the rate at which mutations occur, and the fact that most mutations are deleterious. It was all a bit of a mess. . .Mr Mayr's advantage over the laboratory-bound biologists who had hijacked and diluted Darwin's legacy was that, like Darwin, he was a naturalist-and a good one. (anonymous, 2005)

Of course, this is a magazine article, written by anonymous staff writers-- typically one doesn't see such florid language in the scholarly literature. But did the staff writers of the Economist (representing elite opinion) really originate this story, based on their own personal recollections of the 1930's? Of course not. Mayr himself popularized the image of geneticists as laboratory-bound geeks lacking the organic insight of "naturalists". This disdain for the geneticists who "hijacked" Darwin's legacy is readily apparent when evolutionary writers depict geneticists as fools holding "beliefs" that have "obvious inadequacies", unable to understand or "grasp" their own scientific findings:
"It is hard for us to comprehend but, in the early years of this century when the phenomenon of mutation was first named, it was regarded not as a necessary part of Darwinian theory but as an alternative theory of evolution! There was a school of geneticists called the mutationists, which included such famous names as Hugo de Vries and William Bateson among the early rediscoverers of Mendel's principles of heredity, Wilhelm Johannsen the inventor of the word gene, and Thomas Hunt Morgan the father of the chromosome theory of heredity. . . Mendelian genetics was thought of, not as the central plank of Darwinism that it is today, but as antithetical to Darwinism. . . It is extremely hard for the modern mind to respond to this idea with anything but mirth" (Dawkins, 1987, p. 305)

"According to mutationism, random changes in the hereditary material are sufficient for adaptation without much, or any, selection at all. Mutations just somehow happen to be adaptive, the right changes simply manage to occur. The inadequacies of this view are obvious" (Cronin, 1991, p. 47).

"Darwin knew nothing of this [i.e., genetics] but as it turned out, his ignorance was sublimely irrelevant to the problem he was really interested in tackling: evolution. This point was not fully grasped by biologists. Many early geneticists at the dawn of the 20th century, thought their discoveries of the fundamental principles of genetics somehow cast doubt [on], or rendered obsolete, the concept of natural selection. It took several decades of experimentation and theoretical (including mathematical) analysis to show not only that there was no conflict inherent between the emerging results of genetics and the older Darwinian notion of natural selection, but that the two operate in different domains." (Eldredge, 2001, p. 67)

"Mendelian particulate inheritance (today, we call the "particles" genes) was originally identified with De Vries's "mutation theory", according to which new variations or species originated in large jumps, or macromutations, and evolution was exclusively explained by mutation pressure. Darwinian naturalists, believing that Mendelism was synonymous with mutation theory, held on to theories of soft inheritance, while they considered selection a weak force at best. They did not know of the new findings in genetics that would have supported Darwinism. (SegerstrŒle, 2002)

Notice how, in every version of the story above, the position taken by early geneticists just doesn't make sense. This isn't a story of theory versus theory, its a story of confusion ultimately yielding to reason.

If de Vries and the other geneticists are playing the role of the pied piper in this story, the "naturalists" are like the children lured away from their Darwinian home. Ultimately the innocents are returned, and order restored, by (oddly enough) mathematicians:

"Between 1918 and 1932 Fisher, Haldane, and Wright showed that Mendelian genetics is consistent with natural selection. Only then, more than 60 years after the publication of The Origin of Species, was the genetic objection to natural selection finally removed. Modern molecular and developmental genetics have confirmed in exquisite chemical detail the key aspects of genetics necessary for Darwin's ideas to work: that the genetic material is DNA, that DNA has a sequence, . . . mutates . . . contains information . . " (p. 16 of Stearns and Hoekstra, 2005)

Anatomy of a Myth


In a subsequent post, we will look at original sources to see what the "mutationists" actually believed, and why. And eventually we will integrate this into the bigger picture of how evolutionary theory developed. But for now, lets just summarize the pattern that is apparent in the literature.

First, the mutationism story is clearly a story or myth, and not an ordinary scientific truth claim. We can see this because the story-tellers are not using ordinary scientific conventions to convince us that the story is true. If you or I were making an ordinary scientific argument (for instance) for an effect of "translational selection" on codon usage, we would mention a correlation between codon frequencies and the abundance of corresponding tRNAs, citing the classic work of Ikemura (1981), and we might even repeat a figure showing this correlation, to impress this point upon the minds of readers (e.g., just as in Ch. 7 of Freeman & Herron, 1998).

When I see instances of the mutationism story, typically I don't find quotations illustrating what the mutationists believed, nor facts & figures to refute their views, but only vague attributions and generalized claims. Apropos, the following quotation from Ernst Mayr never fails to make me laugh:

The genetic work of the last four decades has refuted mutationism (saltationism) so thoroughly that it is not necessary to repeat once more all the genetic evidence against it. (Mayr, 1960)

And the puissant Dr. Mayr proceeds on, not boring the reader with any tiresome "genetic evidence", nor citing sources that might allow the reader to evaluate the truth of his statement. Its a story, after all.

By contrast, the 3 sources that I mentioned above as providing scientific history, rather than myth, all make reference to specific experimental and theoretical results, and reveal knowledge of specific historically important scientific works. For instance, Strickberger's reference list includes Johannsen, 1903, as well as the 1902 paper by Yule that reconciled Mendelian genetics with quantitative variation (in neo-Darwinian mythology, credit for Yule's work is given to little Ronny Fisher, who was 11 at the time).

Second, every story has a plot or "action", and the main action of the mutationism story is a turn of fate in which power is temporarily in the hands of the wrong people or ideas. In archetypal terms, its a story of usurpation and restoration: the throne is usurped, and the kingdom falls into darkness and confusion until the throne is restored to the king's rightful heirs. The mutationism episode didn't have to be told that way: it might have been presented as a period of reform (in which old ideas were abandoned) or discovery (when new territory was mapped out). Instead, its presented as a mistake, an interlude of confusion, a collective delusion.

Indeed, another way to look at the mythic action is that the Mendelians are wizards or false prophets who place the kingdom under a spell, leading folks astray and causing them to believe things that they just shouldn't have believed.

What delusional spell did the Mendelians cast? In the story by Eldredge, or by Stearns & Hoekstra above, the spell is that Mendelian genetics is inconsistent with "the concept of natural selection" (Eldredge). In the story told by SegerstrŒle, Cronin, Mayr and The Economist, the delusional spell is a bit different: the principle of selection is irrelevant because mutational jumps alone explain evolution.

Third, the key to restoring Darwin's kingdom was to add the missing piece of genetics. Ultimately, after the period of darkness ended, the discovery of genetics "provided the missing link in Darwin's theory" (SegerstrŒle, 2002), or "The missing link in Darwin's argument was provided by Mendelian genetics" (Ayala & Fitch, 1997). Darwinism was restored, not by taking away the power of genetics, but by redirecting it to support Darwinism. Clearly, genetics is the key to ruling the kingdom, like the One Ring that Rules them All in Tolkien's world. The ones who have the ring have the power.

The story is made more fascinating by the fact that the key to power is literally a code of rules developed by a monk that remained lost for nearly half a century. The usurpers who discover The Monk's Code misinterpret it, and use it to overthrow the true king, establishing a reign of error. But when The Founders decipher the true meaning of the Monk's Code, The Architects campaign throughout the kingdom, spreading the news: the Monk's Code proves that Darwin is the true king. Darwin's rule is re-established, all opposition ceases, and the kingdom is unified.

Homework


If you would like to contribute a mutationism story, I would be happy to start a collection if you make it easy for me by providing a complete and well formed text item. Be sure to provide a quoted passage with a source, citing exact page numbers. If we get enough stories, lets try to recruit a sociologist or historian to study this further.

Summary


To summarize, the mutationism story is a myth that is retold in secondary sources. The basic story is simple: the discoverers of genetics misinterpreted their discovery, thinking it incompatible with Darwinism; Darwinism went into disfavor; population geneticists came along and showed that genetics was the missing key to Darwinism; Darwinism was restored and once again reigned supreme.

Next time on the The Curious Disconnect, we'll start pulling on some of the loose threads of this story.

For now, note how the writers quoted above are genuinely baffled by our scientific history. It just doesn't make sense to them. A century ago, most of an entire generation of scientists thought of genetics as a contradiction of Darwinism. This is a historical fact, and presumably it has an explanation that rational folks can understand by examining what scientists of the time wrote. But this historical fact mystifies Dawkins, Eldredge, Cronin, and others.

References

Anonymous. 2005. Ernst Mayr, evolutionary biologist, died on February 3rd, aged 100. The Economist, February.

Ayala, F. J., and W. M. Fitch. 1997. Genetics and the origin of species: an introduction. Proc Natl Acad Sci U S A 94:7691-7697.

Cronin, H. 1991. The Ant and the Peacock. Cambridge University Presss, Cambridge.

Dawkins, R. 1987. The Blind Watchmaker. W.W. Norton and Company, New York.

Eldredge, N. 2001. The Triumph of Evolution and the Failure of Creationism. W H Freeman & Co.

Freeman, S., and J. C. Herron. 1998. Evolutionary Analysis. Prentice-Hall, Upper Saddle River, New Jersey.

Gould, S. J. 2002. The Structure of Evolutionary Theory. Harvard University Press, Cambridge, Massachusetts.

Ikemura, T. 1981. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J Mol Biol 151:389-409.

Mayr, E. 1960. The Emergence of Evolutionary Novelties. Pp. 349-380 in S. Tax, and C. Callender, eds. Evolution After Darwin: The University of Chicago Centennial. University of Chicago Press, Chicago.

SegerstrŒle, U. 2002. Neo-Darwinism. Pp. 807-810 inM. Pagel, ed. Encyclopedia of Evolution. Oxford University Press, New York.

Stearns, S. C., and R. F. Hoekstra. 2005. Evolution: an introduction. Oxford University Press, New York.

Strickberger, M.W. 1990. Evolution (1st edition).

Notes
1 The defining characteristic of a myth is not that it isn't literally true, but that it isn't told for reason of being literally true, but for reason of being meaningful or poignant: a myth is a story with a cultural value, not necessarily a literal-truth value. The connection between myths and untruths, then, has to do with discoverability: when we find a pattern P = { X people are repeating story Y }, where X is a large number, this pattern by itself does not prove that Y is a myth because X people might have all discovered or verified Y independently; but if Y has diverse elements that are untrue (or unverifiable), then we can conclude that its repetition does not signify independent verification, suggesting that its a myth.



2The Oxford Encyclopedia of Evolution does not have an article on mutationism; the article on Morgan says nothing of his views on evolution; there is no article on Bateson; mutationism is only addressed peripherally in Hull's article on the history of evolutionary theory; it is mainly addressed in SegerstrŒle's article on neo-Darwinism.



Monday, April 30, 2007

Herbert Tabor/Journal of Biological Chemistry Lectureship

 
One of the big events for ASBMB is the Herbert Tabor JBC lecture. It was held Saturday night in one of the large ballrooms. There were about one thousand people attending.

The first lecture was by Tony Hunter from The Salk Institute in California (USA). He spoke about mammalian kinases and phosphorylases with an emphasis on tyrosine kinases, which he discovered back in 1979. Tyrosine kinases are enzymes that attach phosphate groups to tyrosine residues in proteins. They are important because the phosphorylation and dephosphorylation of enzymes regulates their activity. Many of the genes that cause cancer (oncogenes) encode tyrosine kinases.

Hunter is trying to find out how many different proteins kinases there are in humans. The latest count suggests about 900 different enzymes. This is a remarkable number when you think about it. It means that 3-4% of all genes in our genome are kinases.

The second award winner was Tony Pawson from the Samuel Lunenfeld Research Institute and the University of Toronto (Ontario, Canada). I've heard Tony speak many times so I wasn't quite as attentive during his lecture. Tony discovered a number of proteins domains, notably the SH2 domain, that interact with tyrosine kinases and their target proteins. The work of the two Tony's is complementary and that's why they received this joint award.

UPDATE: I forgot to mention that there was a reception after the talks. Lots of delicious munchies and an open bar. I had a beer (or two). Most biochemists drink wine or fruit juice. It was not a wild bunch.

Friday, November 01, 2013

Vertebrate Complexity Is Explained by the Evolution of Long-Range Interactions that Regulate Transcription?

The Deflated Ego Problem is a very serious problem in molecular biology. It refers to the fact that many molecular biologists were puzzled and upset to learn that humans have about the same number of genes as all other multicellular eukaryotes. The "problem" is often introduced by stating that the experts working on the human genome project expected at least 100,000 genes but were "shocked' when the first draft of the human genome showed only 30,000 genes (now down to about 25,000). This story is a myth as I document in: Facts and Myths Concerning the Historical Estimates of the Number of Genes in the Human Genome. Truth is, most knowledgeable experts expected that humans would have about the same number of genes as other animals. They realized that the differences between fruit flies and humans, for example, didn't depend on a host of new human genes but on the timing and expression of a mostly common set of genes.

This isn't good enough for many human chauvinists. They are still looking for something special that sets human apart from all other animals. I listed seven possibilities in my post on the deflated ego problem:

Thursday, February 07, 2008

Theme: Genomes & Junk DNA

Junk in Your Genome

Transposable Elements: (44% junk)

      DNA transposons:
         active (functional): <0.1%
         defective (nonfunctional): 3%
      retrotransposons:
         active (functional): <0.1%
         defective transposons
            (full-length, nonfunctional): 8%
            L1 LINES (fragments, nonfunctional): 16%
            other LINES: 4%
            SINES (small pseudogene fragments): 13%
            co-opted transposons/fragments: <0.1% a
aCo-opted transposons and transposon fragments are those that have secondarily acquired a new function.
Viruses (9% junk)

      DNA viruses
         active (functional): <0.1%
         defective DNA viruses: ~1%
      RNA viruses
         active (functional): <0.1%
         defective (nonfunctional): 8%
         co-opted RNA viruses: <0.1% b
bCo-opted RNA viruses are defective integrated virus genomes that have secondarily acquired a new function.
Pseudogenes (1.2% junk)
      (from protein-encoding genes): 1.2% junk
      co-opted pseudogenes: <0.1% c
cCo-opted pseudogenes are formerly defective pseudogenes those that have secondarily acquired a new function.
Ribosomal RNA genes:
      essential 0.22%
      junk 0.19%

Other RNA encoding genes
      tRNA genes: <0.1% (essential)
      known small RNA genes: <0.1% (essential)
      putative regulatory RNAs: ~2% (essential) Protein-encoding genes: (9.6% junk)
      transcribed region:  
            essential 1.8%  
            intron junk (not included above) 9.6% d
dIntrons sequences account for about 30% of the genome. Most of these sequences qualify as junk but they are littered with defective transposable elements that are already included in the calculation of junk DNA.
Regulatory sequences:
      essential 0.6%

Origins of DNA replication
      <0.1% (essential) Scaffold attachment regions (SARS)
      <0.1% (essential) Highly Repetitive DNA (1% junk)
      α-satellite DNA (centromeres)
            essential 2.0%
            non-essential 1.0%%
      telomeres
            essential (less than 1000 kb, insignificant)

Intergenic DNA (not included above)
      conserved 2% (essential)
      non-conserved 26.3% (unknown but probably junk)

Total Essential/Functional (so far) = 8.7%
Total Junk (so far) = 65%
Unknown (probably mostly junk) = 26.3%
For references and further information click on the "Genomes & Junk DNA" link in the box

LAST UPDATE: May 10, 2011 (fixed totals, and ribosomal RNA calculations)





November 11, 2006
Sea Urchin Genome Sequenced

The sea urchin genome is 814,000 kb or about 1/4 the size of a typical mammalian genome. Like mammalian genomes, the sea urchin genome contains a lot of junk DNA, especially repetitive DNA. The preliminary count of the number of genes is 23,300. This is about the same number that we have in our genomes. Only about 10,000 of these genes have been annotated by the sea urchin sequencing team.

Monday, January 04, 2016

Answering two questions from Vincent Torley

Vincent Torley read a post by Jerry Coyne where Jerry wondered if Intelligent Design Creationism was in trouble because the Discovery Institute has lost Bill Dembski and Casey Luskin [Is the Discovery Institute falling apart?].

Torley disagrees, obviously, but he focuses on a couple of the scientific statements in Jerry Coyne's post and comes up with Two quick questions for Professor Coyne.

I hope Professor Coyne won't mind if I answer.

Before answering, let's take note of the fact that Vincent Torley has been convinced by the evidence that most of our genome is junk. I wonder how that will go over in the ID community?

Here's question #1 ...

Tuesday, June 19, 2007

What is a gene, post-ENCODE?

Back in January we had a discussion about the definition of a gene [What is a gene?]. At that time I presented my personal preference for the best definition of a gene.
A gene is a DNA sequence that is transcribed to produce a functional product.
This is a definition that's widely shared among biochemists and molecular biologists but there are competing definitions.

Now, there's a new kid on the block. The recent publication of a slew of papers from the ENCODE project has prompted many of the people involved to proclaim that a revolution is under way. Part of the revolution includes redefining a gene. I'd like to discuss the paper by Mark Gerstein et al. (2007) [What is a gene, post-ENCODE? History and updated definition] to see what this revolution is all about.

The ENCODE project is a large scale attempt to analyze and annotate the human genome. The first results focus on about 1% of the genome spread out over 44 segments. These results have been summarized in an extraordinarily complex Nature paper with massive amounts of supplementary material (The Encode Project Consortium, 2007). The Nature paper is supported by dozens of other papers in various journals. Ryan Gregory has a list of blog references to these papers at ENCODE links.

I haven't yet digested the published results. I suspect that like most bloggers there's just too much there to comment on without investing a great deal of time and effort. I'm going to give it a try but it will require a lot of introductory material, beginning with the concept of alternative splicing, which is this week's theme.

The most widely publicized result is that most of the human genome is transcribed. It might be more correct to say that the ENCODE Project detected RNA's that are either complimentary to much of the human genome or lead to the inference that much of it is transcribed.

This is not news. We've known about this kind of data for 15 years and it's one of the reasons why many scientists over-estimated the number of humans genes in the decade leading up to the publication of the human genome sequence. The importance of the ENCODE project is that a significant fraction of the human genome has been analyzed in detail (1%) and that the group made some serious attempts to find out whether the transcripts really represent functional RNAs.

My initial impression is that they have failed to demonstrate that the rare transcripts of junk DNA are anything other than artifacts or accidents. It's still an open question as far as I'm concerned.

It's not an open question as far as the members of the ENCODE Project are concerned and that brings us to the new definition of a gene. Here's how Gerstein et al. (2007) define the problem.
The ENCODE consortium recently completed its characterization of 1% of the human genome by various high-throughput experimental and computational techniques designed to characterize functional elements (The ENCODE Project Consortium 2007). This project represents a major milestone in the characterization of the human genome, and the current findings show a striking picture of complex molecular activity. While the landmark human genome sequencing surprised many with the small number (relative to simpler organisms) of protein-coding genes that sequence annotators could identify (~21,000, according to the latest estimate [see www.ensembl.org]), ENCODE highlighted the number and complexity of the RNA transcripts that the genome produces. In this regard, ENCODE has changed our view of "what is a gene" considerably more than the sequencing of the Haemophilus influenza and human genomes did (Fleischmann et al. 1995; Lander et al. 2001; Venter et al. 2001). The discrepancy between our previous protein-centric view of the gene and one that is revealed by the extensive transcriptional activity of the genome prompts us to reconsider now what a gene is.
Keep in mind that I personally reject the premise and I don't think I'm alone. As far as I'm concerned, the "extensive transcriptional activity" could be artifact and I haven't had a "protein-centric" view of a gene since I learned about tRNA and ribosomal RNA genes as an undergraduate in 1967. Even if the ENCODE results are correct my preferred definition of a gene is not threatened. So, what's the fuss all about?

Regulatory Sequences
Gerstein et al. are worried because many definitions of a gene include regulatory sequences. Their results suggest that many genes have multiple large regions that control transcription and these may be located at some distance from the transcription start site. This isn't a problem if regulatory sequences are not part of the gene, as in the definition quoted above (a gene is a transcribed region). As a mater of fact, the fuzziness of control regions is one reason why most modern definitions of a gene don't include them.
Overlapping Genes
According to Gerstein et al.
As genes, mRNAs, and eventually complete genomes were sequenced, the simple operon model turned out to be applicable only to genes of prokaryotes and their phages. Eukaryotes were different in many respects, including genetic organization and information flow. The model of genes as hereditary units that are nonoverlapping and continuous was shown to be incorrect by the precise mapping of the coding sequences of genes. In fact, some genes have been found to overlap one another, sharing the same DNA sequence in a different reading frame or on the opposite strand. The discontinuous structure of genes potentially allows one gene to be completely contained inside another one’s intron, or one gene to overlap with another on the same strand without sharing any exons or regulatory elements.
We've known about overlapping genes ever since the sequences of the first bacterial operons and the first phage genomes were published. We've known about all the other problems for 20 years. There's nothing new here. No definition of a gene is perfect—all of them have exceptions that are difficult to squeeze into a one-size-fits-all definition of a gene. The problem with the ENCODE data is not that they've just discovered overlapping genes, it's that their data suggests that overlapping genes in the human genome are more the rule than the exception. We need more information before accepting this conclusion and redefining the concept of a gene based on analysis of the human genome.
Splicing
Splicing was discovered in 1977 (Berget et al. 1977; Chow et al. 1977; Gelinas and Roberts 1977). It soon became clear that the gene was not a simple unit of heredity or function, but rather a series of exons, coding for, in some cases, discrete protein domains, and separated by long noncoding stretches called introns. With alternative splicing, one genetic locus could code for multiple different mRNA transcripts. This discovery complicated the concept of the gene radically.
Perhaps back in 1978 the discovery of splicing prompted a re-evaluation of the concept of a gene. That was almost 30 years ago and we've moved on. Now, many of us think of a gene as a region of DNA that's transcribed and this includes exons and introns. In fact, the modern definition doesn't have anything to do with proteins.

Alternative splicing does present a problem if you want a rigorous definition with no fuzziness. But biology isn't like that. It's messy and you can't get rid of fuzziness. I think of a gene as the region of DNA that includes the longest transcript. Genes can produce multiple protein products by alternative splicing. (The fact that the definition above says "a" functional product shouldn't mislead anyone. That was not meant to exclude multiple products.)

The real problem here is that the ENCODE project predicts that alternative splicing is abundant and complex. They claim to have discovered many examples of splice variants that include exons from adjacent genes as shown in the figure from their paper. Each of the lines below the genome represents a different kind of transcript. You can see that there are many transcripts that include exons from "gene 1" and "gene 2" and another that include exons from "gene 1" and "gene 4." The combinations and permutations are extraordinarily complex.

If this represents the true picture of gene expression in the human genome, then it would require a radical rethinking of what we know about molecular biology and evolution. On the other hand, if it's mostly artifact then there's no revolution under way. The issue has been fought out in the scientific literature over the past 20 years and it hasn't been resolved to anyone's satisfaction. As far as I'm concerned the data overwhelmingly suggests that very little of that complexity is real. Alternative splicing exists but not the kind of alternative splicing shown in the figure. In my opinion, that kind of complexity is mostly an artifact due to spurious transcription and splicing errors.
Trans-splicing
Trans-splicing refers to a phenomenon where the transcript from one part of the genome is attached to the transcript from another part of the genome. The phenomenon has been known for over 20 years—it's especially common in C. elegans. It's another exception to the rule. No simple definition of a gene can handle it.
Parasitic and mobile genes
This refers mostly to transposons. Gerstein et al say, "Transposons have altered our view of the gene by demonstrating that a gene is not fixed in its location." This isn't true. Nobody has claimed that the location of genes is fixed.
The large amount of "junk DNA" under selection
If a large amount of what we now think of as junk DNA turns out to be transcribed to produce functional RNA (or proteins) then that will be a genuine surprise to some of us. It won't change the definition of a gene as far as I can see.
The paper goes on for many more pages but the essential points are covered above. What's the bottom line? The new definition of an ENCODE gene is:
There are three aspects to the definition that we will list below, before providing the succinct definition:
  1. A gene is a genomic sequence (DNA or RNA) directly encoding functional product molecules, either RNA or protein.
  2. In the case that there are several functional products sharing overlapping regions, one takes the union of all overlapping genomic sequences coding for them.
  3. This union must be coherent—i.e., done separately for final protein and RNA products—but does not require that all products necessarily share a common subsequence.
This can be concisely summarized as:
The gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products.
On the surface this doesn't seem to be much different from the definition of a gene as a transcribed region but there are subtle differences. The authors describe how their new definition works using a hypothetical example.

How the proposed definition of the gene can be applied to a sample case. A genomic region produces three primary transcripts. After alternative splicing, products of two of these encode five protein products, while the third encodes for a noncoding RNA (ncRNA) product. The protein products are encoded by three clusters of DNA sequence segments (A, B, and C; D; and E). In the case of the three-segment cluster (A, B, C), each DNA sequence segment is shared by at least two of the products. Two primary transcripts share a 5' untranslated region, but their translated regions D and E do not overlap. There is also one noncoding RNA product, and because its sequence is of RNA, not protein, the fact that it shares its genomic sequences (X and Y) with the protein-coding genomic segments A and E does not make it a co-product of these protein-coding genes. In summary, there are four genes in this region, and they are the sets of sequences shown inside the orange dashed lines: Gene 1 consists of the sequence segments A, B, and C; gene 2 consists of D; gene 3 of E; and gene 4 of X and Y. In the diagram, for clarity, the exonic and protein sequences A and E have been lined up vertically, so the dashed lines for the spliced transcripts and functional products indicate connectivity between the proteins sequences (ovals) and RNA sequences (boxes). (Solid boxes on transcripts) Untranslated sequences, (open boxes) translated sequences.
This isn't much different from my preferred definition except that I would have called the region containing exons C and D a single gene with two different protein products. Gerstein et al (2007) split it into two different genes.

The bottom line is that in spite of all the rhetoric the "new" definition of a gene isn't much different from the old one that some of us have been using for a couple of decades. It's different from some old definitions that other scientists still prefer but this isn't revolutionary. That discussion has already been going on since 1980.

Let me close by making one further point. The "data" produced by the ENCODE consortium is intriguing but it would be a big mistake to conclude that everything they say is a proven fact. Skepticism about the relevance of those extra transcripts is quite justified as is skepticism about the frequency of alternative splicing.


Gerstein, M.B., Bruce, C., Rozowsky, J.S., Zheng, D., Du, J., Korbel, J.O., Emanuelsson, O., Zhang, Z.D., Weissman, S. and Snyder, M. (2007) What is a gene, post-ENCODE? History and updated definition. Genome Res. 17:669-681.

The ENCODE Project Consortium (2007) Nature 447:799-816. [PDF]

[Hat Tip: Michael White at Adaptive Complexity]

Wednesday, June 09, 2021

Let's analyze the Newsweek lab leak conspiracy theory article

Lots of people have been sucked in to the lab leak conspiracy theory based on reporting in newspapers and magazines. One of the widely-cited sources is an article published in Newsweek on June 2, 2021. The focus of the article is on How Amateur Sleuths Broke the Wuhan Lab Story and Embarrassed the Media. Those "amateur sleuths" go by the name "Decentralized Radical Autonomous Search Team Investigating COVID-19" or DRASTIC. I'm not interested in them; I'm interested in scientific facts so let's look at all of the so-called "facts" in the Newsweek article. I'll leave it up to you, dear reader, to judge whether the media should be embarrassed by this story.

Newsweek statment #1: Thanks to DRASTIC, we now know that the Wuhan Institute of Virology had an extensive collection of coronaviruses gathered over many years of foraging in the bat caves, and that many of them—including the closest known relative to the pandemic virus, SARS-CoV-2—came from a mineshaft where three men died from a suspected SARS-like disease in 2012.

Some of this is correct. The WIV scientists and their collaborators have been collecting samples from bats all over China and Indochina for several years and many of them have been examined for the presence of coronaviruses. WIV scientists routinely sampled bats from the Yunnan mine cave from 2012 to 2015 after they were informed that four people had been admitted to hospital with severe respiratory disease in 2012 (one of them died). The workers tested negative for Ebola, Nipah virus, and coronavirus so the scientists were looking for a likely unknown virus that caused the infection. (The serum samples were subsequently tested for SARS-CoV-2 and they were negative.)

Several coronaviruses were detected in the bat samples based on short PCR sequences (370 bp) from the RdRp gene and they were classified as either alphacoronaviruses or betacoronaviruses. The data was published in 2016 (Ge et al., 2016) and the sequences were deposited in GenBank in 2016. Improvements in sequencing technology in 2018 prompted a re-examination of those bat samples and an almost full-length sequence of a betacoronavirus was obtained (missing the 5′ and 3′ ends). This virus was named RaTG13 and one of the short GenBank sequences identified as BtCoV/4491 (Accession #KP876546) comes from that virus (Zhou et al., 2020 Addendum).

The bat virus is RaTG13 and it is 96% similar in sequence to SARS-CoV-2—that means that they probably shared a common ancestor about 50 years ago (Zhou et al. 2020). The sequence was deposited in GenBank as Accession #Mn996532. There are parts of SARS-CoV-2 that are not closely related to RaTG13 and this includes the spike protein gene, which is essential for infecting humans. The spike gene sequence is most closely related to a coronavirus from pangolins, Pangolin-Cov.

The data is consistent with a recombination event between different strains of coronaviruses giving rise to SARS-CoV-2 or its immediate ancestors. Such recombinations are a common feature of coronavirus propagation in various animals, including bats. What's clear is that none of the currently known coronavirus sequences could possibly be the ancestors of SARS-CoV-2 so the hunt is on to locate those viruses.

Recently, the scientists at WIV and their collaboratore at the University of Chinese Academy of Sciences in Beijing looked at some of the other samples from bat anal swabs collected in Yunnan in 2015. This in depth analysis was prompted by the discovery of SARS-CoV-2 and the pandemic. They found a number of other bat coronavisus sequences and some of them were more closely related to SARS-CoV-2 in the ORF1b regions but not in other parts of the genome. Again, this is consistent with frequent recombination events that have been documented over the past few decades. Surprisingly, some of these new bat coronavisuses were able to use the bat angiotensin-converting enzyme 2 (ACE2) as a receptor, but they did not bind to human ACE2. (These assays take a lot of time and effort.) This and other data show that the evolution of ACE2 binding can occur in bats giving rise to a generalist virsus, SARS-CoV-2, than can bind to ACE2 from many different species. (MacLean et al., 2021; Guo et al., 2021).

A group of scientists from France, United States, Vietnam, and Cambodia looked at bat samples that were collected in Cambodia in 2010 and found coronaviruses from another species of bats that were cloesly related to SARS-CoV-2 across most of the genome except for a small region of the spike protein gene. In some parts of the genome (ORF1a and ORF8) these viruses were more closely related to SARS-CoV-2 than RATG13 (Hu et al. 2021). The evolutionary history of the Cambodian viruses indicate that they are mosaic viruses due to recombination events. This data indicates that SARS-CoV-2 related viruses are found in Southeast Asia as well as China—that's signficant since pangolins are only found in Southeast Asia and not in China.

SARS-CoV-2-like viruses have also been found in Thailand (Wacharapluesadee et al., 2021).

A group centered in Taian, China, has recently examined coronaviruses from bats at the botanical garden in Mengal county in Yunnan. They have identified four additional SARS-CoV-2 related viruses including one, RpYN06, that is the closest relative to SARS-CoV-2 outside of the spike gene. This is now the leading candidate for the "backbone" that might have given rise to the pandemic virus (Zhou et al., 2021).

CONCLUSION: The Newsweek statement is not wrong but it is highly misleading. The WIV labs had bat samples that contained coronaviruses but so did lots of labs all over the world. In that sense, these labs have an "extensive collection of coronaviruses" but they are stored in bat poop at -80° C! They identified two coronavirus, RaTG13 and RmYN02, by sequencing PCR fragments but the sequences were not complete. It's misleading for Newsweek to imply that the WIV labs had an RaTG13 coronavirus in their labs because that implies that they were working with active viruses. It's true that the RaTG13 virus came from a place where several workers had gotten sick with respiratory disease a few years before the sample was collected. One of these men died (not three) but none of the patients tested positive for coronavirus.

Newsweek statement #2: We know that the WIV was actively working with these viruses, using inadequate safety protocols, in ways that could have triggered the pandemic, and that the lab and Chinese authorities have gone to great lengths to conceal these activities.

CONCLUSION: This is misleading. As far as I know, the scientists are WIV were not actively working with the RaTG13 virus because they had never isolated that virus. Furthermore, it's almost impossible to create SARS-CoV-2 from RaTG13 [Could scientists use the bat coronavirus RaTG13 to engineer SARS-CoV-2, the virus that causes COVID-19, in a lab?]. They were working with other bat coronaviruses but none of them were closely related to SARS-CoV-2 so it's extremely misleading to imply that the escape of these viruses could have triggered the pandemic. They were not using inadequate safety protocols because all of the work with bat coronaviruses was carried out in level 2 labs, exactly as required. There's no evidence that the scientists at the WIV labs have concealed anything. You can only accuse someone of concealing something if you have strong evidence that they did something that they deny doing.

Newsweek statement #3: We know that the first cases appeared weeks before the outbreak at the Huanan wet market that was once thought to be ground zero.

CONCLUSION: This is correct. Chinese scientists and health workers identified a number of earlier cases that appear to be unrelated to the seafood market and they published their results in scientific journals over a year ago. They now conclude that the virus was circulating in the Wuhan population for more than a month before the superspreader event at the market ignited the pandemic. This appears to be a case where Newsweek trusts the work of Chinese scientists.

Newsweek statement #4: The Newsweek article talks a lot about the DRASTIC group as though they have uncovered a huge conspriacy theory. One of their "discoveries" relates to the bat coronavirus RaTG13 that's first mentioned in the paper where the SARS-CoV-2 sequence was published. Here's what Newsweek wrote: "The paper was vague about where RaTG13 had come from. It didn't say exactly where or when RaTG13 had been found, just that it had previously been detected in a bat in Yunnan Province, in southern China.

The paper aroused Deigin's suspicions. He wondered if SARS-CoV-2 might have emerged through some genetic mixing and matching from a lab working with RaTG13 or related viruses. His post was cogent and comprehensive. The Seeker posted Deigin's theory on Reddit, which promptly suspended his account permanently."

CONCLUSION: This is written like it's a big mystery that was uncovered by some clever sleuthing. It's true that the origin of RaTG13 was not discussed in the SARS-CoV-2 paper in January 2020 other than to say that it was found in a bat in Yunnan. I assume that the authors didn't think it was important (and still don't). The origin was explained in November 2020 in an Addendum to the Nature article (Zhou et al., 2020, Addendum). It was one of the viruses discoverd in the bats from the Yunnan mine cave and a partial sequence had been published earlier (Ge et al., 2016). It's not particulary close to SARS-CoV-2 and there's no reason to speculate that it was artificially created unless you are trying to create a conspiracy.

Newsweek statement #5: The key facts quickly came together. The genetic sequence for RaTG13 perfectly matched a small piece of genetic code posted as part of a paper written by Shi Zhengli years earlier, but never mentioned again. The code came from a virus the WIV had found in a Yunnan bat. Connecting key details in the two papers with old news stories, the DRASTIC team determined that RaTG13 had come from a mineshaft in Mojiang County, in Yunnan Province, where six men shoveling bat guano in 2012 had developed pneumonia. Three of them died. DRASTIC wondered if that event marked the first cases of human beings being infected with a precursor of SARS-CoV-2—perhaps RaTG13 or something like it.

In a profile in Scientific American, Shi Zhengli acknowledged working in a mineshaft in Mojiang County where miners had died. But she avoided connecting it to RaTG13 (an omission she had made in her scientific papers as well), claiming that a fungus in the cave had killed the miners.

This reads just like a typical conspiracy theory where "clever" sleuths (i.e. internet anateurs) uncover information that was hidden or covered up by those they are accusing. The origin of RaTG13 was explained in an addendum to the publication of the SARS-CoV-2 sequence in February 2020. The addendum was added in November 2020 in reponse to questions about the origin of RaTG13 but that information was widely known. The sequence of a short fragment of this virus was obtained earlier as explained above.

The WIV scientists were very concerned about the Yunnan mine workers because they had symptoms that were similar to those of SARS patients and that's why they tested serum from the patients. They were negative for all the viruses, including the original SARS-CoV-1. (The serum is also negative for SARS-CoV-2.) The WIV scientists were worried that the infections were due to an unknown virus that could cause a pandemic so they went back to the mine every year to collect samples from the bats. The RaTG13 sequence came from one of those samples but by then the scientists knew that there was no connection between the bat coronaviruses and the sick mine workers. (They were probably disappointed at the lack of connection because they were looking for the cause of the 2002 SARS outbreak.)

The WIV scientists now believe that the Yunnan mine workers had contracted a fungal infection from the fungus growing on the bat guano. There is no reason to connect RaTG13 to the mine workers because it's been known for many years that the workers were not infected with any coronavirus.

The RaTG13 virus is from the bat species Rhinolophus affinis (hence the designation "Ra") but up until the beginning of the pandemic the WIV scientists were much more interested in another cave in Yunnan populated by a number of different species. They reported that this cave represents the most diverse collection of bat coronaviruses in the world. Most of the ones that are SARS-like were from a different species of bat, Rhinolophus sinicus and many of these bound the same ACE2 receptor that SARS-CoV-1 used—the same one used by the more recent SARS-CoV-2 (Hu et al. 2017; Cui et al., 2019).

CONCLUSION: The Newsweek article is repeating innuendos and conspiracies that have been discredited in the past. The DRASTIC team is deliberately making up connections between coroanvirus and the mine workers but all of the data shows that there's no direct connection. It just happened that one of the bat coronaviruses collected in that mine happened to be the one closest to SARS-CoV-2, in part because that was a pretty extensive collection. The RaTG13 sequence is not similar enough to SAS-CoV-2 to be the direct ancestor and, besides, there are now known to be other virus sequences from as far away as Cambodia that are just as similar to SARS-CoV-2.

Newsweek statement #6: That explanation didn't sit well with the DRASTIC group. They suspected a SARS-like virus, not a fungus, had killed the miners and that, for whatever reason, the WIV was trying to hide that fact. It was a hunch, and they had no way of proving it.

At this point, The Seeker revealed his research powers to the group. In his online explorations, he'd recently discovered a massive Chinese database of academic journals and theses called CNKI. Now he wondered if somewhere in its vast circuitry might be information on the sickened miners.

Working through the night at his bedside table on phone and laptop, fueled by chai and using Chinese characters with the help of Google Translate, he plugged in "Mojiang"—the county where the mine was located—in combination with every other word he could think of that might be relevant, instantly translating each new flush of results back to English. "Mojiang + pneumonia"; "Mojiang + WIV"; "Mojiang + bats"; "Mojiang + SARS." Each search brought back thousands of results and half a dozen different databases for journals, books, newspapers, master's theses, doctoral dissertations. He combed through these results, night after night, but never found anything useful. When he ran out of energy, he broke for arcade games and more chai.

He was on the verge of calling it quits, he says, when he struck gold: a 60-page master's thesis written by a student at Kunming Medical University in 2013 titled "The Analysis of 6 Patients with Severe Pneumonia Caused by Unknown Viruses." In exhaustive detail, it described the conditions and step-by-step treatment of the miners. It named the suspected culprit: "Caused by SARS-like [coronavirus] from the Chinese horseshoe bat or other bats."

CONCLUSION: Move along folks; there's nothing to see here. The WIV scientists suspected that the miners were infected with an unknown virus and that's why they were concerned in 2012. They knew that coronavirus wasn't responsible and neither was any other known virus. This is why they went back every year to test the bats in the mine shaft. The know that the stored serum from these workers is negative for SARS-CoV-2, which is not a surprise. They now suspect that the mine workers had contracted a fungl infection and not a viral infection. It's not particulary surprising that a student reported the suspected cause of the symptoms back in the beginning of the investigation.

Newsweek statement #7: Ribera was responsible for solving another piece of the RaTG13 puzzle. Had the WIV been actively working on RaTG13 during the seven years since they discovered it? Peter Daszak said no: they had never used the virus because it wasn't similar enough to the original SARS. "We thought it's interesting, but not high-risk," he told Wired. "So we didn't do anything about it and put it in the freezer."

Ribera disproved that account. When a new science paper on genetics is published, the authors must upload the accompanying genetic sequences to an international database. By examining some metadata tags that had been accidentally uploaded by the WIV along with its genetic sequences for RaTG13, Ribera discovered that scientists at the lab had indeed been actively studying the virus in 2017 and 2018—they hadn't stuck it in a freezer and forgotten about it, after all.

I don't know what this means. The WIV scientists sequenced a bit of what turned out to be the RaTG13 virus when they catagorized all the other viruses back in 2012-2015 (Ge et al. 2016). They then completed an almost whole genome sequence later on in 2018 when their sequencing techniques improved. It's important keep in mind that the WIV never worked with the RaTG13 virus as emphasized by Frutos et al. (2021): "One must remember that SARS-CoV-2 was never found in the wild and that RaTG13 does not exist as a real virus but instead only as a sequence in a computer. It is a virtual virus which thus cannot leak from a laboratory." 1

CONCLUSION: The scientists at WIV were "working with" the RaTG13 PCR fragments in 2017 and 2018 as they assembled the whole genome sequence. They also assembled the sequences of seveal other viruses at the same time. To say that they were "actively studying" the virus is very misleading and to accuse Peter Daszek of lying is irresponsible.

Newsweek statement #8: In fact, the WIV had been intensely interested in RaTG13 and everything else that had come from the Mojiang mineshaft. From his giant Sudoku puzzle, Ribera determined that they made at least seven different trips to the mine, over many years, collecting thousands of samples. Ribera's guess is that their technology had not been good enough in 2012 and 2013 to find the virus that had killed the miners, so they kept going back as the techniques improved.

He also made a bold prediction. Cross-referencing snippets of information from multiple sources, Ribera guessed, in a Twitter thread dated August 1, 2020, that a cluster of eight SARS-related viruses mentioned briefly in an obscure section of one WIV paper had actually also come from the Mojiang mine. In other words, they hadn't found one relative of SARS-CoV-2 in that mineshaft; they'd found nine. In November 2020, Shi Zhengli confirmed many of DRASTIC's suspicions about the Mojiang cave in an addendum to her original paper on RaTG13 and in a talk in February 2021.

The mine shaft is located in Mojiang county, Yunnan—a map of the location was published in Ge et al. (2016). It contains six different bat species and many of them were infected with coronaviruses. The WIV scientists collected many samples over a number of years in order to determine the phylogeny of the viruses and which species were infected. They also did longitudinal studies to see if the different virus variants changed over time and to see if the infection rates of the various bat species were different from year to year. They also wanted to see if they could detect recombinations between different virus groups.

They obtained 152 partial sequences and then picked 12 of them for more detailed analysis in order to construct a phylogenetic tree from 816 bp of the RNA-dependent RNA polymerase (RdRp) gene. Anyone can read the Ge et al. (2016) paper to see why they were doing these experiments. There's nothing mysterious or unusual about their approach. It's the same one they took with the viruses from the other site (cave) in Yunnan where they identified the two bat coronaviruses that are most closely related to the original SARS virus (Ge et al., 2013) (see: SARS ouotbreak linked to Chinese bat cave)

CONCLUSION: The Newsweek article is making a huge mountain out of a molehill and it's misrepresenting the work of the "amateur sleuths." It's not a secret or a mystery that the WIV scientists were studying the coronaviruses from the mine shaft. That's what they do and they publish in journals that are easy to access.

Newsweek statement #9: "Other databases yielded other clues. In the WIV's grant applications and awards, The Seeker found detailed descriptions of the Institute's research plans, and they were damning: Projects were underway to test the infectivity of novel SARS-like viruses they'd discovered in human cells and in lab animals, to see how they might mutate as they crossed species, and to genetically recombine pieces of different viruses—all being done at woefully inadequate biosecurity levels. All the elements for a disaster were on hand."

CONCLUSION: It's true that the WIV scientists were looking at SARS-like coronavisuses and they were testing for infectivity in humanized mouse cells. The goal was to look for new coronaviruses that could bind ACE2 and they found quite a few of them. In many cases, they expressed the spike protein in recombinant viruses and plasmids just as you would expect them to do if they were looking for the source of the original SARS virus (SARS-CoV-1). All this is described in their grant applications and in their publications. Looks like they didn't make much of an attempt to hide this research. All the experiments were done under the appropriate biosafety measures as specified by international inspectors who visited the lab on several occasions. None of this has anything to do with the pandemic because they were not working with SARS-CoV-2 or any close relative.

The rest of the Newsweek article consists mostly of praise for the DRASTIC heros and the excellent work they have done in uncovering a huge conspiracy to cover up the fact that the WIV scientists started a pandemic. However, one embarrassing fact remains: there is not a shred of evidence that the lab was working with SARS-CoV-2 before the pandemic started. In the absence of such evidence it is irresponsible to accuse these reputable scientists of lying.


1. One could quibble slightly about the accuracy of this statment since there might be RaTG13 virus particles in the bat fecal samples that are stored in the -80°C freezer.

Cui, J., Li, F. and Shi, Z.-L. (2019) Origin and evolution of pathogenic coronaviruses. Nature Reviews Microbiology 17:181-192. doi: [doi: 10.1038/s41579-018-0118-9]

Severe acute respiratory syndrome coronavirus (SARS-CoV) and Middle East respiratory syndrome coronavirus (MERS-CoV) are two highly transmissible and pathogenic viruses that emerged in humans at the beginning of the 21st century. Both viruses likely originated in bats, and genetically diverse coronaviruses that are related to SARS-CoV and MERS-CoV were discovered in bats worldwide. In this Review, we summarize the current knowledge on the origin and evolution of these two pathogenic coronaviruses and discuss their receptor usage; we also highlight the diversity and potential of spillover of bat-borne coronaviruses, as evidenced by the recent spillover of swine acute diarrhoea syndrome coronavirus (SADS-CoV) to pigs.

Hu, V., Delaune, D., Karlsson, E.A., Hassanin, A., Tey, P.O., Baidaliuk, A., Gámbaro, F., Tu, V.T., Keatts, L. and Mazet, J. (2021) A novel SARS-CoV-2 related coronavirus in bats from Cambodia. bioRxiv. [doi: 10.1101/2021.01.26.428212]

Knowledge of the origin and reservoir of the coronavirus responsible for the ongoing COVID-19 pandemic is still fragmentary. To date, the closest relatives to SARS-CoV-2 have been detected in Rhinolophus bats sampled in the Yunnan province, China. Here we describe the identification of SARS-CoV-2 related coronaviruses in two Rhinolophus shameli bats sampled in Cambodia in 2010. Metagenomic sequencing identified nearly identical viruses sharing 92.6% nucleotide identity with SARS-CoV-2. Most genomic regions are closely related to SARS-CoV-2, with the exception of a small region corresponding to the spike N terminal domain. The discovery of these viruses in a bat species not found in China indicates that SARS-CoV-2 related viruses have a much wider geographic distribution than previously understood, and suggests that Southeast Asia represents a key area to consider in the ongoing search for the origins of SARS-CoV-2, and in future surveillance for coronaviruses.

Ge, X.-Y., Li, J.-L., Yang, X.-L., Chmura, A.A., Zhu, G., Epstein, J.H., Mazet, J.K., Hu, B., Zhang, W. and Peng, C. (2013) Isolation and characterization of a bat SARS-like coronavirus that uses the ACE2 receptor. Nature 503:535-538. [doi: 10.1038/nature12711]

The 2002–3 pandemic caused by severe acute respiratory syndrome coronavirus (SARS-CoV) was one of the most significant public health events in recent history1. An ongoing outbreak of Middle East respiratory syndrome coronavirus2 suggests that this group of viruses remains a key threat and that their distribution is wider than previously recognized. Although bats have been suggested to be the natural reservoirs of both viruses3,4,5, attempts to isolate the progenitor virus of SARS-CoV from bats have been unsuccessful. Diverse SARS-like coronaviruses (SL-CoVs) have now been reported from bats in China, Europe and Africa5,6,7,8, but none is considered a direct progenitor of SARS-CoV because of their phylogenetic disparity from this virus and the inability of their spike proteins to use the SARS-CoV cellular receptor molecule, the human angiotensin converting enzyme II (ACE2)9,10. Here we report whole-genome sequences of two novel bat coronaviruses from Chinese horseshoe bats (family: Rhinolophidae) in Yunnan, China: RsSHC014 and Rs3367. These viruses are far more closely related to SARS-CoV than any previously identified bat coronaviruses, particularly in the receptor binding domain of the spike protein. Most importantly, we report the first recorded isolation of a live SL-CoV (bat SL-CoV-WIV1) from bat faecal samples in Vero E6 cells, which has typical coronavirus morphology, 99.9% sequence identity to Rs3367 and uses ACE2 from humans, civets and Chinese horseshoe bats for cell entry. Preliminary in vitro testing indicates that WIV1 also has a broad species tropism. Our results provide the strongest evidence to date that Chinese horseshoe bats are natural reservoirs of SARS-CoV, and that intermediate hosts may not be necessary for direct human infection by some bat SL-CoVs. They also highlight the importance of pathogen-discovery programs targeting high-risk wildlife groups in emerging disease hotspots as a strategy for pandemic preparedness.

Ge, X.-Y., Wang, N., Zhang, W., Hu, B., Li, B., Zhang, Y.-Z., Zhou, J.-H., Luo, C.-M., Yang, X.-L. and Wu, L.-J. (2016) Coexistence of multiple coronaviruses in several bat colonies in an abandoned mineshaft. Virologica Sinica 31:31-40. [doi: 10.1007/s12250-016-3713-9]

Since the 2002–2003 severe acute respiratory syndrome (SARS) outbreak prompted a search for the natural reservoir of the SARS coronavirus, numerous alpha- and betacoronaviruses have been discovered in bats around the world. Bats are likely the natural reservoir of alpha- and beta-coronaviruses, and due to the rich diversity and global distribution of bats, the number of bat coronaviruses will likely increase. We conducted a surveillance of coronaviruses in bats in an abandoned mineshaft in Mojiang County, Yunnan Province, China, from 2012–2013. Six bat species were frequently detected in the cave: Rhinolophus sinicus, Rhinolophus affinis, Hipposideros pomona, Miniopterus schreibersii, Miniopterus fuliginosus, and Miniopterus fuscus. By sequencing PCR products of the coronavirus RNA-dependent RNA polymerase gene (RdRp), we found a high frequency of infection by a diverse group of coronaviruses in different bat species in the mineshaft. Sequenced partial RdRp fragments had 80%–99% nucleic acid sequence identity with well-characterized Alphacoronavirus species, including BtCoV HKU2, BtCoV HKU8, and BtCoV1,and unassigned species BtCoV HKU7 and BtCoV HKU10. Additionally, the surveillance identified two unclassified betacoronaviruses, one new strain of SARS-like coronavirus, and one potentially new betacoronavirus species. Furthermore, coronavirus co-infection was detected in all six batspecies, a phenomenon that fosters recombination and promotes the emergence of novel virus strains. Our findings highlight the importance of bats as natural reservoirs of coronaviruses and the potentially zoonotic source of viral pathogens.

Guo, H., Hu, B., Si, H.-r., Zhu, Y., Zhang, W., Li, B., Li, A., Geng, R., Lin, H.-F. and Yang, X.-L. (2021) Identification of a novel lineage bat SARS-related coronaviruses that use bat ACE2 receptor. bioRxiv. [doi: 10.1101/2021.05.21.445091]

Severe respiratory disease coronavirus-2 (SARS-CoV-2) causes the most devastating disease, COVID-19, of the recent century. One of the unsolved scientific questions around SARS-CoV-2 is the animal origin of this virus. Bats and pangolins are recognized as the most probable reservoir hosts that harbor the highly similar SARS-CoV-2 related viruses (SARSr-CoV-2). Here, we report the identification of a novel lineage of SARSr-CoVs, including RaTG15 and seven other viruses, from bats at the same location where we found RaTG13 in 2015. Although RaTG15 and the related viruses share 97.2% amino acid sequence identities to SARS-CoV-2 in the conserved ORF1b region, but only show less than 77.6% to all known SARSr-CoVs in genome level, thus forms a distinct lineage in the Sarbecovirus phylogenetic tree. We then found that RaTG15 receptor binding domain (RBD) can bind to and use Rhinolophus affinis bat ACE2 (RaACE2) but not human ACE2 as entry receptor, although which contains a short deletion and has different key residues responsible for ACE2 binding. In addition, we show that none of the known viruses in bat SARSr-CoV-2 lineage or the novel lineage discovered so far use human ACE2 efficiently compared to SARSr-CoV-2 from pangolin or some of the SARSr-CoV-1 lineage viruses. Collectively, we suggest more systematic and longitudinal work in bats to prevent future spillover events caused by SARSr-CoVs or to better understand the origin of SARS-CoV-2.

MacLean, O.A., Lytras, S., Weaver, S., Singer, J.B., Boni, M.F., Lemey, P., Pond, S.L.K. and Robertson, D.L. (2021) Natural selection in the evolution of SARS-CoV-2 in bats created a generalist virus and highly capable human pathogen. PLoS Biology 19:e3001115. [doi: 10.1371/journal.pbio.3001115]

Virus host shifts are generally associated with novel adaptations to exploit the cells of the new host species optimally. Surprisingly, Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has apparently required little to no significant adaptation to humans since the start of the Coronavirus Disease 2019 (COVID-19) pandemic and to October 2020. Here we assess the types of natural selection taking place in Sarbecoviruses in horseshoe bats versus the early SARS-CoV-2 evolution in humans. While there is moderate evidence of diversifying positive selection in SARS-CoV-2 in humans, it is limited to the early phase of the pandemic, and purifying selection is much weaker in SARS-CoV-2 than in related bat Sarbecoviruses. In contrast, our analysis detects evidence for significant positive episodic diversifying selection acting at the base of the bat virus lineage SARS-CoV-2 emerged from, accompanied by an adaptive depletion in CpG composition presumed to be linked to the action of antiviral mechanisms in these ancestral bat hosts. The closest bat virus to SARS-CoV-2, RmYN02 (sharing an ancestor about 1976), is a recombinant with a structure that includes differential CpG content in Spike; clear evidence of coinfection and evolution in bats without involvement of other species. While an undiscovered “facilitating” intermediate species cannot be discounted, collectively, our results support the progenitor of SARS-CoV-2 being capable of efficient human–human transmission as a consequence of its adaptive evolutionary history in bats, not humans, which created a relatively generalist virus.

Wacharapluesadee, S., Tan, C.W., Maneeorn, P., Duengkae, P., Zhu, F., Joyjinda, Y., Kaewpom, T., Chia, W.N., Ampoot, W. and Lim, B.L. (2021) Evidence for SARS-CoV-2 related coronaviruses circulating in bats and pangolins in Southeast Asia. Nature communications 12:1-9. doi: [doi: 10.1038/s41467-021-21240-1]

Among the many questions unanswered for the COVID-19 pandemic are the origin of SARS-CoV-2 and the potential role of intermediate animal host(s) in the early animal-to-human transmission. The discovery of RaTG13 bat coronavirus in China suggested a high probability of a bat origin. Here we report molecular and serological evidence of SARS-CoV-2 related coronaviruses (SC2r-CoVs) actively circulating in bats in Southeast Asia. Whole genome sequences were obtained from five independent bats (Rhinolophus acuminatus) in a Thai cave yielding a single isolate (named RacCS203) which is most related to the RmYN02 isolate found in Rhinolophus malayanus in Yunnan, China. SARS-CoV-2 neutralizing antibodies were also detected in bats of the same colony and in a pangolin at a wildlife checkpoint in Southern Thailand. Antisera raised against the receptor binding domain (RBD) of RmYN02 was able to cross-neutralize SARS-CoV-2 despite the fact that the RBD of RacCS203 or RmYN02 failed to bind ACE2. Although the origin of the virus remains unresolved, our study extended the geographic distribution of genetically diverse SC2r-CoVs from Japan and China to Thailand over a 4800-km range. Cross-border surveillance is urgently needed to find the immediate progenitor virus of SARS-CoV-2.

Zhou, P., Yang, X.-L., Wang, X.-G., Hu, B., Zhang, L., Zhang, W., Si, H.-R., Zhu, Y., Li, B. and Huang, C.-L. (2020) A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579:270-273. [doi: 10.1038/s41586-020-2012-7]

Since the outbreak of severe acute respiratory syndrome (SARS) 18 years ago, a large number of SARS-related coronaviruses (SARSr-CoVs) have been discovered in their natural reservoir host, bats1,2,3,4. Previous studies have shown that some bat SARSr-CoVs have the potential to infect humans5,6,7. Here we report the identification and characterization of a new coronavirus (2019-nCoV), which caused an epidemic of acute respiratory syndrome in humans in Wuhan, China. The epidemic, which started on 12 December 2019, had caused 2,794 laboratory-confirmed infections including 80 deaths by 26 January 2020. Full-length genome sequences were obtained from five patients at an early stage of the outbreak. The sequences are almost identical and share 79.6% sequence identity to SARS-CoV. Furthermore, we show that 2019-nCoV is 96% identical at the whole-genome level to a bat coronavirus. Pairwise protein sequence analysis of seven conserved non-structural proteins domains show that this virus belongs to the species of SARSr-CoV. In addition, 2019-nCoV virus isolated from the bronchoalveolar lavage fluid of a critically ill patient could be neutralized by sera from several patients. Notably, we confirmed that 2019-nCoV uses the same cell entry receptor—angiotensin converting enzyme II (ACE2)—as SARS-CoV.

Zhou, P. et al. (2020) Addendum: A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 588:E6-E6. [doi: 10.1038/s41586-020-2951-z]

Zhou, H., Ji, J., Chen, X., Bi, Y., Li, J., Hu, T., Song, H., Chen, Y., Cui, M. and Zhang, Y. (2021) Identification of novel bat coronaviruses sheds light on the evolutionary origins of SARS-CoV-2 and related viruses. bioRxiv. doi: [doi: 10.1101/2021.03.08.434390]

Although a variety of SARS-CoV-2 related coronaviruses have been identified, the evolutionary origins of this virus remain elusive. We describe a meta-transcriptomic study of 411 samples collected from 23 bat species in a small (~1100 hectare) region in Yunnan province, China, from May 2019 to November 2020. We identified coronavirus contigs in 40 of 100 sequencing libraries, including seven representing SARS-CoV-2-like contigs. From these data we obtained 24 full-length coronavirus genomes, including four novel SARS-CoV-2 related and three SARS-CoV related genomes. Of these viruses, RpYN06 exhibited 94.5% sequence identity to SARS-CoV-2 across the whole genome and was the closest relative of SARS-CoV-2 in the ORF1ab, ORF7a, ORF8, N, and ORF10 genes. The other three SARS-CoV-2 related coronaviruses were nearly identical in sequence and clustered closely with a virus previously identified in pangolins from Guangxi, China, although with a genetically distinct spike gene sequence. We also identified 17 alphacoronavirus genomes, including those closely related to swine acute diarrhea syndrome virus and porcine epidemic diarrhea virus. Ecological modeling predicted the co-existence of up to 23 Rhinolophus bat species in Southeast Asia and southern China, with the largest contiguous hotspots extending from South Lao and Vietnam to southern China. Our study highlights both the remarkable diversity of bat viruses at the local scale and that relatives of SARS-CoV-2 and SARS-CoV circulate in wildlife species in a broad geographic region of Southeast Asia and southern China. These data will help guide surveillance efforts to determine the origins of SARS-CoV-2 and other pathogenic coronaviruses.

Wednesday, July 31, 2013

The Dark Matter Rises

John Mattick is a Professor and research scientist at the Garvan Institute of Medical Research at the University of New South Wales (Australia).

John Mattick publishes lots of papers. Most of them are directed toward proving that almost all of the human genome is functional. I want to remind you of some of the things that John Mattick has said in the past so you'll be prepared to appreciate my next post [The Junk DNA Controversy: John Mattick Defends Design].

Mattick believes that the Central Dogma means DNA makes RNA makes protein. He believes that scientists in the past took this very literally and discounted the importance of RNA. According to Mattick, scientists in the past believed that genes were the only functional part of the genome and that all genes encoded proteins.

If that sounds familiar it's because there are many IDiots who make the same false claim. Like Mattick, they don't understand the Central Dogma of Molecular Biology and they don't understand the history that they are distorting.

Mattick believes that there is a correlation between the amount of noncoding DNA in a genome and the complexity of the organism. He thinks that the noncoding DNA is responsible for making tons of regulatory RNAs and for regulating expression of the genes. This belief led him to publish a famous figure (left) in Scientific American.

Mattick has many followers. So many, in fact, that the Human Genome Organization (HUGO) recently gave him an award for his contributions to the study of the human genome. Here's the citation.
Theme
Genomes
& Junk DNA
The Award Reviewing Committee commented that Professor Mattick’s “work on long non-coding RNA has dramatically changed our concept of 95% of our genome”, and that he has been a “true visionary in his field; he has demonstrated an extraordinary degree of perseverance and ingenuity in gradually proving his hypothesis over the course of 18 years.”
Let's see what this "true visionary" is saying this year. The first paper is "The dark matter rises: the expanding world of regulatory RNAs" (Clark et al., 2013). Here's the abstract ...
The ability to sequence genomes and characterize their products has begun to reveal the central role for regulatory RNAs in biology, especially in complex organisms. It is now evident that the human genome contains not only protein-coding genes, but also tens of thousands of non–protein coding genes that express small and long ncRNAs (non-coding RNAs). Rapid progress in characterizing these ncRNAs has identified a diverse range of subclasses, which vary widely in size, sequence and mechanism-of-action, but share a common functional theme of regulating gene expression. ncRNAs play a crucial role in many cellular pathways, including the differentiation and development of cells and organs and, when mis-regulated, in a number of diseases. Increasing evidence suggests that these RNAs are a major area of evolutionary innovation and play an important role in determining phenotypic diversity in animals.
This is his main theme. Mattick believes that a large percentage of the human genome is devoted to making regulatory RNAs that control development. He believes that the evolution of this complex regulatory network is responsible for the creation of complex organisms like humans, which, incidentally, are the pinnicle of evolution according to the figure shown above.

The second paper I want to highlight focuses on a slightly different theme. It's title is "Understanding the regulatory and transcriptional complexity of the genome through structure." (Mercer and Mattick, 2013). In this paper he emphasizes the role of noncoding DNA in creating a complicated three-dimensional chromatin structure within the nucleus. This structure is important in regulating gene expression in complex organisms. Here's the abstract ...
An expansive functionality and complexity has been ascribed to the majority of the human genome that was unanticipated at the outset of the draft sequence and assembly a decade ago. We are now faced with the challenge of integrating and interpreting this complexity in order to achieve a coherent view of genome biology. We argue that the linear representation of the genome exacerbates this complexity and an understanding of its three-dimensional structure is central to interpreting the regulatory and transcriptional architecture of the genome. Chromatin conformation capture techniques and high-resolution microscopy have afforded an emergent global view of genome structure within the nucleus. Chromosomes fold into complex, territorialized three-dimensional domains in concert with specialized subnuclear bodies that harbor concentrations of transcription and splicing machinery. The signature of these folds is retained within the layered regulatory landscapes annotated by chromatin immunoprecipitation, and we propose that genome contacts are reflected in the organization and expression of interweaved networks of overlapping coding and noncoding transcripts. This pervasive impact of genome structure favors a preeminent role for the nucleoskeleton and RNA in regulating gene expression by organizing these folds and contacts. Accordingly, we propose that the local and global three-dimensional structure of the genome provides a consistent, integrated, and intuitive framework for interpreting and understanding the regulatory and transcriptional complexity of the human genome.
Other posts about John Mattick.

How Not to Do Science
John Mattick on the Importance of Non-coding RNA
John Mattick Wins Chen Award for Distinguished Academic Achievement in Human Genetic and Genomic Research
International team cracks mammalian gene control code
Greg Laden Gets Suckered by John Mattick
How Much Junk in the Human Genome?
Genome Size, Complexity, and the C-Value Paradox


Clark, M.B., Choudhary, A., Smith, M.A., Taft, R.J. and Mattick, J.S. (2013) The dark matter rises: the expanding world of regulatory RNAs. Essays in Biochemistry 54:1-16. [doi:10.1042/bse0540001]

Mercer, T.R. and Mattick, J.S. (2013) Understanding the regulatory and transcriptional complexity of the genome through structure. Genome research 23:1081-1088 [doi: 10.1101/gr.156612.113]