Sandwalk: The Importance of Sequence Alignments

Thursday, September 05, 2013

The Importance of Sequence Alignments

There are several required steps in constructing phylogenetic trees from sequence data. The first step is to align the sequences so you can make direct comparisons. It used to be the case that multiple sequence alignments had to be checked manually because none of the available computer programs were as good as an experienced scientist. That hasn't changed. What's changed is that the data sets have become so large and complicated that nobody wants to even look at the sequence alignments to see if they can be improved.

Drew et al. (2013) suggest that sequence alignments should be made available.

Until recently, uploading sequences to GenBank (or EMBL) was generally considered sufficient to ensure reproducibility of phylogenetic studies using DNA sequence data. Increasingly, however, the systematics community is realizing that archiving raw DNA sequences is not adequate, and that the underlying alignments of DNA sequences as well as the resulting phylogenetic trees are pivotal for reproducibility, comparative purposes, meta-analyses, and ultimately synthesis. Indeed, there has been a growing clamor for journals to adopt and enforce more rigorous data archiving practices across diverse disciplines [4]–[8]. As a result, about 35 evolutionary journals [5],[9] have adopted policies to encourage or require authors to upload alignments, phylogenetic trees, and other files requisite for study reproducibility [5] to TreeBASE (http://treebase.org/) and/or other public repositories such as Dryad (http://datadryad.org). Unfortunately, enforcement of such data deposition policies is generally lax, and most journals in systematics and evolution still do not require DNA sequence alignment or tree deposition. As a result, the alignments and trees underlying most published papers in systematics/phylogenetics and evolutionary biology remain inaccessible to the scientific community at large [8],[10].

I sympathize with the goal but I doubt that it can be achieved. I strongly suspect that many scientists don't even bother to produce sequence alignments. They just feed the electronic data directly into their tree-making algorithm.

I wonder how many anomalies could be resolved if they just looked at the alignments? Would they even know if bad sequence data was being used for one or two species in their alignment?

Drew, B.T., Gazis, R., Cabezas, P., Swithers, K.S., Deng, J., Rodriguez, R., Katz, L.A., Crandall, K.A., Hibbett, D.S., and Soltis, D.E. (2013) Lost Branches on the Tree of Life. PLoS Biol 11(9): e1001636. [doi: 10.1371/journal.pbio.1001636]

20 comments :

Konrad said...: Agreed that alignment quality is key, but I wonder why you think the goal is unrealistic?; Thursday, September 05, 2013 3:02:00 PM
John Harshman said...: I don't know of any tree-making programs that don't require an alignment as input. Do you? (POY may be an exception, but I seem to recall that it demands at least a few anchor points.) So yes, everyone is actually producing alignments, even if they're using automated data pipelines, whether or not they pay any attention to how it's being done.

And I agree with the author that public availability of alignments is crucial to reproducibility of science. There's no excuse for journals failing to enforce that policy. If you're producing sequence alignments and aren't depositing them in Treebase, shame on you.

Alignment programs, by the way, are now much better than they used to be. And another tool available today that wasn't a while ago is the iterative process that builds a tree from an initial alignment, then uses that as a guide tree for a new alignment, and so on until topology stabilizes, which is an attempt at approaching the goal of the globally most likely tree-alignment combination.; Thursday, September 05, 2013 3:31:00 PM
Konrad said...: There are quite a few tools for alignment-free phylogeny construction - the idea is to average over the alignment uncertainty instead of basing the phylogeny on a single point estimate of the alignment. In principle this should make phylogeny construction more robust (if the tools are implemented sensibly; and off the cuff I'd say there may be unanswered questions on how to handle indels in these approaches). However, I'm not familiar with any of the existing tools that do this so I can't give an opinion on how well they work in practice.; Thursday, September 05, 2013 5:40:00 PM
John Harshman said...: Could you name some of those tools?; Thursday, September 05, 2013 6:18:00 PM
un said...: John,

Here's a review of alignment-free methods: Click here.

I believe that they are becoming more and more prevalent.; Thursday, September 05, 2013 11:37:00 PM
John Harshman said...: Those appear to be alignment-free methods of sequence comparison, not phylogenetic analysis, at least based on the abstract. Like BLAST.; Friday, September 06, 2013 9:25:00 AM
nmanning said...: I always eyeball my alignments, regardless of what algorithm I might use to get an initial alignment. The problem is, as Larry alludes to, the size of the datasets. I am currently working on a gigantic alignment (more than 200 taxa, ~15 kb of sequence for each) and not only is it difficult to get a program to even give me an initial alignment, but the prospect of then having to inspect it by eye is giving my the willies...; Friday, September 06, 2013 10:07:00 AM
NickM said...: "I strongly suspect that many scientists don't even bother to produce sequence alignments. They just feed the electronic data directly into their tree-making algorithm."

No, almost all phylogenetics programs require that you input an alignment, not the raw sequences. What you do is write a pipeline that feeds the raw data to an alignment program like ClustalW, THEN blindly feed that to a phylogenetic program ;-).

Seriously, though, what is appropriate depends on the situation. If the sequences are 95% identical, an automated program would be fine and there is no point wasting time with a manual alignment. If similarity is weak, then you need to be much more careful.

If the dataset is huge, you have no choice but to automate, but also you have some hope that any noise in alignment will average out.

http://www.ncbi.nlm.nih.gov/pubmed/16012107; Friday, September 06, 2013 11:26:00 AM
NickM said...: Oops, meant to add, the link at the bottom is an exception, joint Bayesian inference of the alignment and the phylogeny.; Friday, September 06, 2013 11:27:00 AM
John Harshman said...: Hey, we had an alignment of 169 species x 25kb, all checked laboriously by eye. Hard work, but there was no adequate substitute, and we had a dozen or so people doing it. It does appear (from unpublished tests) that an automated, iterative alignment procedure (SATE) can produce similar results, or at least similar trees, without eyeball alignment. But that makes me nervous.

The real problems arise with you're trying to do the same thing with megabases of sequence. No numbers of eyeballs are going to handle that.; Friday, September 06, 2013 12:08:00 PM
caynazzo said...: What Josh Harshman said. Leaving NGS aside for now...what tree building programs don't require alignments? For multilocus data sets, you edit and align raw ABI files in Sequencher, CodeonCode Aligner or Mesquite before using anything like RAxML.

I think publishing your primers sequences in the manuscript and making your tissue/DNA available upon request is much more important than having those alignment sequences available.

NGS data is a whole new beast. We've been using CLC Genomics Workbench for editing, alignments and de novo assembly. Not perfect by any means.; Friday, September 06, 2013 1:08:00 PM
Joe Felsenstein said...: To repeat what Dr. Matzke [congratulations, Nick!] said: Aside from "alignment-free" methods, which use much less of the information than methods that require and alignment, there are Bayesian methods that jointly sample from the posterior distribution of alignments and trees, under some assumptions about the priors. It's early days yet in that literature, but it does eliminate the possible biases in making the alignment. It doesn't get you one phylogeny, or one alignment, but a cloud of pairs, each consisting of an alignment and a tree.

A search with terms: Bayesian alignment phylogeny will show some of this literature. One such program is BAli-Phy; Friday, September 06, 2013 6:59:00 PM
Jonathan Badger said...: Well, there are simultaneous alignment/phylogeny systems such as Tandy Warnow's Sate (http://www.cs.utexas.edu/~phylo/software/sate/) -- in such systems no pre-existing alignment is needed -- the alignment and phylogeny parts are integrated and iteratively improve each other. It's a clever idea, although I haven't actually used it to create any trees I've used in publications.; Friday, September 06, 2013 8:02:00 PM
John Harshman said...: SATE isn't really a simultaneous alignment/phylogeny system. It's sequential and iterative, and requires an initial guide tree to build the initial alignment. I suppose you might also use an initial alignment to build an initial guide tree, but at any rate the steps are sequential. The iterative approach is interesting and useful, but it isn't simultaneous.; Saturday, September 07, 2013 11:29:00 AM
John Harshman said...: One question that arises is one of practicality. How much time would be required to come to convergence in a reasonably-sized (taxa x sequence) data set, say 100 x 25,000, with frequent indels? A day? A month? Your lifetime?; Saturday, September 07, 2013 11:31:00 AM
Joe Felsenstein said...: Don't know, but you could run BAli-Phy and try to find out.

The method is useful as a conceptual exercise, whether or not it is practical. to clarify how the iinference of the alignment and the inference of the tree affect each other.; Saturday, September 07, 2013 7:39:00 PM
Joe Felsenstein said...: I should add that the great pioneer of understanding the interaction of multiple sequence alignment and tree inference is David Sankoff. His 50 years of work, including his present status as the central figure in computational comparative genomics, were celebrated at a special symposium, MAGE which happened two weeks ago. David's first paper on trees and alignments was the Sankoff-Morel-Cedergen paper in Nature New Biology 40 years ago.; Saturday, September 07, 2013 7:46:00 PM
John Harshman said...: Wait a minute: Josh?; Saturday, September 07, 2013 10:12:00 PM
caynazzo said...: Apologies, John!; Monday, September 09, 2013 5:49:00 PM
John Harshman said...: Oddly enough, my Field Museum ID said Josh, and I never bothered to correct it. I was wondering if you were someone I know in disguise.; Monday, September 09, 2013 9:25:00 PM

Quotations

The old argument of design in nature, as given by Paley, which formerly seemed to me to be so conclusive, fails, now that the law of natural selection has been discovered. We can no longer argue that, for instance, the beautiful hinge of a bivalve shell must have been made by an intelligent being, like the hinge of a door by man. There seems to be no more design in the variability of organic beings and in the action of natural selection, than in the course which the wind blows.Charles Darwin (c1880)

Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed, during a long course of years, from a point of view directly opposite to mine. It is so easy to hide our ignorance under such expressions as "plan of creation," "unity of design," etc., and to think that we give an explanation when we only restate a fact. Any one whose disposition leads him to attach more weight to unexplained difficulties than to the explanation of a certain number of facts will certainly reject the theory.

Charles Darwin (1859)

Science reveals where religion conceals. Where religion purports to explain, it actually resorts to tautology. To assert that "God did it" is no more than an admission of ignorance dressed deceitfully as an explanation...

Peter Atkins

Quotations

The world is not inhabited exclusively by fools, and when a subject arouses intense interest, as this one has, something other than semantics is usually at stake. Stephen Jay Gould (1982)
I have championed contingency, and will continue to do so, because its large realm and legitimate claims have been so poorly attended by evolutionary scientists who cannot discern the beat of this different drummer while their brains and ears remain tuned to only the sounds of general theory. Stephen Jay Gould (2002) p.1339
The essence of Darwinism lies in its claim that natural selection creates the fit. Variation is ubiquitous and random in direction. It supplies raw material only. Natural selection directs the course of evolutionary change. Stephen Jay Gould (1977)
Rudyard Kipling asked how the leopard got its spots, the rhino its wrinkled skin. He called his answers "just-so stories." When evolutionists try to explain form and behavior, they also tell just-so stories—and the agent is natural selection. Virtuosity in invention replaces testability as the criterion for acceptance. Stephen Jay Gould (1980)
Since 'change of gene frequencies in populations' is the 'official' definition of evolution, randomness has transgressed Darwin's border and asserted itself as an agent of evolutionary change. Stephen Jay Gould (1983) p.335
The first commandment for all versions of NOMA might be summarized by stating: "Thou shalt not mix the magisteria by claiming that God directly ordains important events in the history of nature by special interference knowable only through revelation and not accessible to science." In common parlance, we refer to such special interference as "miracle"—operationally defined as a unique and temporary suspension of natural law to reorder the facts of nature by divine fiat. Stephen Jay Gould (1999) p.84

Quotations

My own view is that conclusions about the evolution of human behavior should be based on research at least as rigorous as that used in studying nonhuman animals. And if you read the animal behavior journals, you'll see that this requirement sets the bar pretty high, so that many assertions about evolutionary psychology sink without a trace.

Jerry Coyne
Why Evolution Is True

I once made the remark that two things disappeared in 1990: one was communism, the other was biochemistry and that only one of them should be allowed to come back.

Sydney Brenner
TIBS Dec. 2000

It is naïve to think that if a species' environment changes the species must adapt or else become extinct.... Just as a changed environment need not set in motion selection for new adaptations, new adaptations may evolve in an unchanging environment if new mutations arise that are superior to any pre-existing variations

Douglas Futuyma

One of the most frightening things in the Western world, and in this country in particular, is the number of people who believe in things that are scientifically false. If someone tells me that the earth is less than 10,000 years old, in my opinion he should see a psychiatrist.

Francis Crick

There will be no difficulty in computers being adapted to biology. There will be luddites. But they will be buried.

Sydney Brenner

An atheist before Darwin could have said, following Hume: 'I have no explanation for complex biological design. All I know is that God isn't a good explanation, so we must wait and hope that somebody comes up with a better one.' I can't help feeling that such a position, though logically sound, would have left one feeling pretty unsatisfied, and that although atheism might have been logically tenable before Darwin, Darwin made it possible to be an intellectually fulfilled atheist

Richard Dawkins

Another curious aspect of the theory of evolution is that everybody thinks he understand it. I mean philosophers, social scientists, and so on. While in fact very few people understand it, actually as it stands, even as it stood when Darwin expressed it, and even less as we now may be able to understand it in biology.

Jacques Monod

The false view of evolution as a process of global optimizing has been applied literally by engineers who, taken in by a mistaken metaphor, have attempted to find globally optimal solutions to design problems by writing programs that model evolution by natural selection.

Richard Lewontin

More Recent Comments

Thursday, September 05, 2013

The Importance of Sequence Alignments

20 comments :