More Recent Comments

Tuesday, October 14, 2008

Bacteria Phylogeny: Facing Up to the Problems

There are millions of species of bacteria. Sorting out their evolutionary history has been a major challenge for decades. Unlike the much bigger, multicellular, eukaryotes, there are few morphological markers to assist scientists in classifying bacteria. The fossil record is mostly silent.

Molecular evolution came to the rescue thirty years ago when cloning and sequencing became common. Soon there were elaborate and detailed phylogenetic trees based on comparing sequences of conserved genes from many species.

The gene of choice was the one for the small subunit ribosomal RNA (SSU rRNA). This gene was well conserved in bacteria and it was easy to get sequences simply by PCR. (The ends of the SSU rRNA gene are conserved and this means that you can develop universe primers for PCR.)

Over the years, the SSU rRNA gene has become what is called the "gold standard" in bacterial phylogeny and taxonomy. Many species have been assigned to taxa based entirely on the sequence of their SSU rRNA gene. Unfortunately, the "gold standard" has become somewhat tarnished lately.

Our fellow blogger, Jonathan Eisen of The Tree of Life, has recently published a paper that looks at the problems with bacterial phylogeny (Wu and Eisen, 2008). He posted a brief summary of the paper and commented on why he likes the journal Genome Biology [Happy Open Access Day: Back to Genome Biology for Me].

ResearchBlogging.orgThere is much to like about this paper. The authors face up to the problems with the current bacterial phylogeny, which is based almost entirely on a single gene (SSU rRNA). They point out that this is risky given what we know about molecular phylogenies. Furthermore, in the case of the SSU ribosomal RNA gene we know for a fact that this has led to problems and inconsistencies. In addition to the practical difficulties there are good theoretical reasons for being suspicious of phylogenies constructed from nucleotide sequences.

What to do? One possible solution is to abandon SSU rRNA as a "gold standard" and replace it with a highly conserved protein coding gene. Unfortunately, this doesn't get around the problem of relying on a single gene. The way around this is to use an artificial concatenated sequence made up of several different conserved genes laid out end-to-end in one large string of amino acids.

So why isn't this done? Because, as Wu and Eisen point out, it ain't that easy. The main difficulty in any phylogenetic study is getting a proper alignment. This is a problem that many workers simply ignore when they use automated alignment software like CLUSTALW. These workers assume that the alignments are valid.

They aren't, and this is another example of facing up to the problem. Many scientists agonize over what program to use when constructing their trees—should they use maximum likelihood, parsimony, etc. etc.? In most cases these decisions are a complete waste of time because their alignments aren't good enough to make a difference.

Here's how Wu and Eisen explain it ...
It has been shown that alignment quality can have greater impact on the final tree than does the tree-building method employed [20]. Therefore, preparing high quality sequence alignments is a most critical part of any molecular phylogenetic analysis. This preparation typically involves careful but tedious manual editing and trimming of the generated alignments, and thus remains the biggest challenge to automation. When scaling up this process, the trimming step is often simply ignored. Automated trimming based on the number of gaps in each column or each column's conservation score can be used to select conserved blocks, but still is not satisfactory when a high quality tree is required.
Keep in mind that what is being proposed is a large tree based on concatenated sequences from many genes. You don't want to do multiple sequence alignments for every gene by hand, and yet up until now, that was the only way to get accurate results.

Wu and Eisen have written a program called AMPHORA that hopefully solves this problem. They begin by manually creating "seed alignments" that are manually curated. Then they use AMPHORA to align all the other sequences to the seed alignments. In this way they hope to overcome the limitations of automated multisequence alignment without having to align everything by hand.

None of this would be possible, of course, unless there were large numbers of species where every one of the target genes have been cloned and sequenced. In the 20th century this would have been impossible but now there are hundreds of completely sequenced bacterial genomes. This means that each one of them has a sequenced copy of the genes required for this kind of analysis.

All that's left is to identify the completely sequenced genomes and pick the set of genes. There are 578 genomes in the database but many of these are close relatives that will not be useful in constructing a large tree of all bacterial sequences. The final set contains 310 genomes with representatives of all the major groups.

The authors selected 31 genes for their initial proof of principle paper (dnaG, frr, infC, nusA, pgk, pyrG, rplA, rplB, rplC, rplD, rplE, rplF, rplK, rplL, rplM, rplN, rplP, rplS, rplT, rpmA, rpoB, rpsB, rpsC, rpsE, rpsI, rpsJ, rpsK, rpsM, rpsS, smpB, tsf). Those of you who recognize these genes will see that 21 of them are small ribosomal proteins. This was not the best choice, in my opinion, but the authors of the paper note that they are continuing the study by incorporating better genes such as HSP70 (dnaL) and EF-Tu (tufA). You can't just choose any conserved gene because it has to be present in most species and there are surprisingly few genes that meet that criterion.

After all that, what's the bottom line? The grand phylogeny is shown at the top of this posting. It resolves many groups that are unresolvable using the SSU rRNA tree. In some cases this tree reveals species that have been incorrectly assigned to higher taxa. These species will have to be reclassified if this result holds up.

The most important finding is that the method works and it yields trees with excellent resolution of the major bacterial taxa.

Wu, Martin, Eisen, Jonathan (2008). A simple, fast, and accurate method of phylogenomic inference Genome Biology, 9:R151 [Genome Biology] [doi:10.1186/gb-2008-9-10-r151]


Christopher Taylor said...

Very nice (the DOI link to the paper doesn't work at the moment, by the way, but the other one does). The main problem that I can imagine with using concatenated sequences for prokaryotes (without my being an expert, of course) would be the Horizontal Gene Transfer issue - if different gene sections are giving you different but both perfectly accurate estimates of phylogeny, what effect will that have on your result? I've wondered if prokaryotes might be one group of organisms where a supertree approach might be more effective - making separate trees from separate genes, then combining them for your end phylogeny.

Sage said...

Thirty years ago Woese was proposing to spin off the Archaebacteria, but he was not sequencing rRNA genes. He was isolating the rRNA itself and using RNA fingerprinting techniques based on Sanger's 1965 2-D electrophoresis method to compile the oligonucleotides produce from RNase digestion. He was measuring homologies and sequences differences, but not actually producing any full rRNA sequences.

PCR, nominally invented in 1984, doesn't open the doors to easy sequencing until after about 1989.

Allen MacNeill said...

What is the point of calling the terminal taxa in prokaryotic phylogenies "species", beyond a wholly spurious analogy with eukaryotes? None of the definitions of "species" that are generally applied to eukaryotes can be applied to prokaryotes for the simple reason that the latter do not couple reproduction with sexual recombination. The Dobzhansky-Mayr "biological species concept" is (like Darwin's) based on reproductive isolation; to be members of the same species, two individuals must be capable of interbreeding and producing fertile offspring under "natural" conditions. Prokaryotes simply don't do this at all.

Lynn Margulis (taking her cue from Sorin Sonea Lucien Mathieu, of the Université de Montreal) asserts that

"...bacteria do not have species at all (or, which amounts to the same thing, all of them together constitute one single cosmopolitan species). Speciation is a property only of nucleated organisms."

For more on the whole problem of "species" in evolutionary biology, see:

Anonymous said...

There is no problem of "species" in evolutionary biology. None. There are a lot of confused people who think that "species" has to be something definable in absolute terms and applicable in 100% cases. Which isn't obviously always possible because Nature is more complex than we wish it to be. ANY classification is oversimplification of reality.

And so yes, taxonomy in prokaryotes is much more difficult but is not fundamentally different from eukaryotes. We still need to call different bacteria different names, right? So why not "species"? There would have to be *some* name anyway. If species don't exist in procaryotes then what does exist? - Some different *word*, obviously. But it's just a word.

Anonymous said...

You could also check out this recent paper on defining species in strictly asexual rotifers for a way to treat speciation where recombination and reproduction have been decoupled.

Larry Moran said...

Allen MacNeill,

What is the point of calling the terminal taxa in prokaryotic phylogenies "species", beyond a wholly spurious analogy with eukaryotes?

You have to call them something and "species" is as good as anything on a blog such as this.

I'm well aware of the problems with defining "species."

You must have a better word or you wouldn't have raised the issue. Perhaps you could share it with us?

BTW, this part of your comment is dead wrong.

The Dobzhansky-Mayr "biological species concept" is (like Darwin's) based on reproductive isolation; to be members of the same species, two individuals must be capable of interbreeding and producing fertile offspring under "natural" conditions. Prokaryotes simply don't do this at all.

Allen MacNeill said...

How, precisely, is the statement "dead wrong"? Seems to me there are several possibilities:

1) That's not what Darwin, Dobzhansky, and Mayr said

2) The "biological species concept" isn't based on reproductive isolation

3) Reproduction in prokaryotes is indeed coupled with sexual recombination in essentially the same way it is in eukaryotes (especially animals)

4) Prokaryotes can be "fertile" or "infertile" in the same way that eukaryotes can

I'm curious; which of these assertions do you agree with, and on what evidence?

Allen MacNeill said...

The reason I don't like to use the term "species" to refer to the termini of prokaryote phylogenies is that it conveys all the wrong ideas about what phylogenies are all about. Focusing on reproductive incompatibility has little or no bearing on diversification among prokaryotes, and is hopelessly muddled by the problem of horizontal gene transfer. Indeed, I would go further: I think the whole idea of "species" is a holdover from Platonic typological thinking, and has done little or nothing to advance our understanding of how genetic and phenotypic diversification has proceeded in most phylogenies, including animals (the only group in which reproductive incompatibility plays a crucial role in phylogenetic diversification).

"Species", in other words, are a figment of the human imagination.

Anonymous said...

Allen MacNeill said:
"Species", in other words, are a figment of the human imagination.

And so are colors. The whole idea of color "green" is a holdover from Platonic typological thinking and has done little or nothing to advance our understanding of how electromagnetic waves are perceived by photoreceptors.

Strangely enough, I still find it useful. :-)

Larry Moran said...

Allen MacNeill asks,

How, precisely, is the statement "dead wrong"? Seems to me there are several possibilities:

... to be members of the same species, two individuals must be capable of interbreeding and producing fertile offspring under "natural" conditions. Prokaryotes simply don't do this at all.

This part is dead wrong. Many bacteria "species" have something akin to sex where individuals can exchange alleles,

Jonathan Eisen said...

Well #1 thanks for writing about our paper

#2 ... I just want to chime in on the species issue. I think there is no reason why we cannot use the term species for bacterial groups. Sure, they are not quite the same groupings as we would see in eukaryotic species, but there really do seem to be true groupings where on average many of the genes in the genome are more similar/related among members of a group than with other groups. To me, that is good enough to call things a species as it indicates some higher rate of gene flow within the group than between. Sure, lateral transfer messes things up, but it seems to not have eliminated consistent groupings (whether you believe Ford Doolittles argument about what this means or not).

Jonathan Eisen said...

Oh, and I note, for the Venter Sargasso Sea paper, I used HSP70, EF-TU, EFG, RpoB and RecA as our protein markers, so maybe you would like those more?

Anonymous said...

It's wonderful for a layperson like me to be able to read about research like this.

I've looked at the figure reproduced in the post in the PDF version at Genome Biology, at a high enough zoom level for my middle-aged eyes to read text. What I see there leads me to the following question: Do the results tend to show bacteria such as Thermus thermophilus and Deinococcus geothermalis at the base of the phylogenetic tree? If so, this may be something well-understood in the field, but I hadn't known it and find it interesting.

Anonymous said...

Hi Jud, the tree in the figure is an unrooted tree so it does not really say anything about the base

Anonymous said...

anonymous wrote: [T]he tree in the figure is an unrooted tree so it does not really say anything about the base.

Yup, sorry, my fault for incorrectly phrasing the question. An attempt to do better: Does it appear possible or likely that bacteria such as Thermus thermophilus and Deinococcus geothermalis were ancestral to the other bacteria species shown in the figure?

Christopher Taylor said...

Jud - as pointed out, the tree as presented is not meant to be read as rooted, so the base of the tree could just as readily be between Gammaproteobacteria and all other lineages as between Deinococci and other lineages as shown. That said, the arrangement shown has obviously been chosen for its similarity to the arrangement found in many rDNA trees, which find Aquificae, Thermotogae and Deinococci quite low on the tree. Whether or not this arrangement represents the actual evolutionary history is still very debatable.

Ultrastructurally, eubacteria can be divided into two broad groups. Monodermata include mostly Gram-positive bacteria with a single cell membrane inside the cell wall, and would include Firmicutes, Actinobacteria and Thermotogae. Didermata are mostly Gram-negative bacteria with two cell membranes, one on either side of the cell wall, and would include everything to the right of Firmicutes in Wu and Eisen's tree, as well as Cyanobacteria, Aquificae and Deinococci. Either one of these two groups could be paraphyletic with regard to each other. The separation of Deinococci and Aquificae from other didermates might indicate a long-branch effect with those two groups appearing in the wrong part of the tree. Alternatively, Monodermata could have arisen polyphyletically through multiple losses of the outer membrane, or Didermata could be polyphyletic with multiple monodermate ancestors developing an outer membrane.

I suppose my central point is that the higher-level bacterial phylogeny is far from settled (and if researchers like Doolittle are correct, may not even be identifiable if time and LGT have eroded all the reliable signals).

Joanna said...

This is a very helpful approach for metagenomic and phylogenetic research. I am a master student in Greece and I would die to try the AMPHORA pipeline in my master thesis and an article we prepare here in our lab at the Biomedical Research Foundation (BRF) of the Academy of Athens. Unfortunately I can not make it work, even after a lot of effort from friends and my shelf. The main problem is that I am not sure about the format of the input file I use (txt in Fasta format of 104 bacterial proteoms). But even when I try to run AMPHORA/ with a reference file, the application runs but it stops without printing or saving the tree. Please help!