Sandwalk: The Importance of Sequence Alignments

Thursday, September 05, 2013

The Importance of Sequence Alignments

There are several required steps in constructing phylogenetic trees from sequence data. The first step is to align the sequences so you can make direct comparisons. It used to be the case that multiple sequence alignments had to be checked manually because none of the available computer programs were as good as an experienced scientist. That hasn't changed. What's changed is that the data sets have become so large and complicated that nobody wants to even look at the sequence alignments to see if they can be improved.

Drew et al. (2013) suggest that sequence alignments should be made available.

Until recently, uploading sequences to GenBank (or EMBL) was generally considered sufficient to ensure reproducibility of phylogenetic studies using DNA sequence data. Increasingly, however, the systematics community is realizing that archiving raw DNA sequences is not adequate, and that the underlying alignments of DNA sequences as well as the resulting phylogenetic trees are pivotal for reproducibility, comparative purposes, meta-analyses, and ultimately synthesis. Indeed, there has been a growing clamor for journals to adopt and enforce more rigorous data archiving practices across diverse disciplines [4]–[8]. As a result, about 35 evolutionary journals [5],[9] have adopted policies to encourage or require authors to upload alignments, phylogenetic trees, and other files requisite for study reproducibility [5] to TreeBASE (http://treebase.org/) and/or other public repositories such as Dryad (http://datadryad.org). Unfortunately, enforcement of such data deposition policies is generally lax, and most journals in systematics and evolution still do not require DNA sequence alignment or tree deposition. As a result, the alignments and trees underlying most published papers in systematics/phylogenetics and evolutionary biology remain inaccessible to the scientific community at large [8],[10].

I sympathize with the goal but I doubt that it can be achieved. I strongly suspect that many scientists don't even bother to produce sequence alignments. They just feed the electronic data directly into their tree-making algorithm.

I wonder how many anomalies could be resolved if they just looked at the alignments? Would they even know if bad sequence data was being used for one or two species in their alignment?

Drew, B.T., Gazis, R., Cabezas, P., Swithers, K.S., Deng, J., Rodriguez, R., Katz, L.A., Crandall, K.A., Hibbett, D.S., and Soltis, D.E. (2013) Lost Branches on the Tree of Life. PLoS Biol 11(9): e1001636. [doi: 10.1371/journal.pbio.1001636]

20 comments:

KonradThursday, September 05, 2013 3:02:00 PM
Agreed that alignment quality is key, but I wonder why you think the goal is unrealistic?
ReplyDelete
Replies
John HarshmanThursday, September 05, 2013 3:31:00 PM
I don't know of any tree-making programs that don't require an alignment as input. Do you? (POY may be an exception, but I seem to recall that it demands at least a few anchor points.) So yes, everyone is actually producing alignments, even if they're using automated data pipelines, whether or not they pay any attention to how it's being done.

And I agree with the author that public availability of alignments is crucial to reproducibility of science. There's no excuse for journals failing to enforce that policy. If you're producing sequence alignments and aren't depositing them in Treebase, shame on you.

Alignment programs, by the way, are now much better than they used to be. And another tool available today that wasn't a while ago is the iterative process that builds a tree from an initial alignment, then uses that as a guide tree for a new alignment, and so on until topology stabilizes, which is an attempt at approaching the goal of the globally most likely tree-alignment combination.
ReplyDelete
Replies
KonradThursday, September 05, 2013 5:40:00 PM
There are quite a few tools for alignment-free phylogeny construction - the idea is to average over the alignment uncertainty instead of basing the phylogeny on a single point estimate of the alignment. In principle this should make phylogeny construction more robust (if the tools are implemented sensibly; and off the cuff I'd say there may be unanswered questions on how to handle indels in these approaches). However, I'm not familiar with any of the existing tools that do this so I can't give an opinion on how well they work in practice.
ReplyDelete
Replies
nmanningFriday, September 06, 2013 10:07:00 AM
I always eyeball my alignments, regardless of what algorithm I might use to get an initial alignment. The problem is, as Larry alludes to, the size of the datasets. I am currently working on a gigantic alignment (more than 200 taxa, ~15 kb of sequence for each) and not only is it difficult to get a program to even give me an initial alignment, but the prospect of then having to inspect it by eye is giving my the willies...
ReplyDelete
Replies
NickMFriday, September 06, 2013 11:26:00 AM
"I strongly suspect that many scientists don't even bother to produce sequence alignments. They just feed the electronic data directly into their tree-making algorithm."

No, almost all phylogenetics programs require that you input an alignment, not the raw sequences. What you do is write a pipeline that feeds the raw data to an alignment program like ClustalW, THEN blindly feed that to a phylogenetic program ;-).

Seriously, though, what is appropriate depends on the situation. If the sequences are 95% identical, an automated program would be fine and there is no point wasting time with a manual alignment. If similarity is weak, then you need to be much more careful.

If the dataset is huge, you have no choice but to automate, but also you have some hope that any noise in alignment will average out.

http://www.ncbi.nlm.nih.gov/pubmed/16012107
ReplyDelete
Replies
NickMFriday, September 06, 2013 11:27:00 AM
Oops, meant to add, the link at the bottom is an exception, joint Bayesian inference of the alignment and the phylogeny.
ReplyDelete
Replies
caynazzoFriday, September 06, 2013 1:08:00 PM
What Josh Harshman said. Leaving NGS aside for now...what tree building programs don't require alignments? For multilocus data sets, you edit and align raw ABI files in Sequencher, CodeonCode Aligner or Mesquite before using anything like RAxML.

I think publishing your primers sequences in the manuscript and making your tissue/DNA available upon request is much more important than having those alignment sequences available.

NGS data is a whole new beast. We've been using CLC Genomics Workbench for editing, alignments and de novo assembly. Not perfect by any means.
ReplyDelete
Replies
Joe FelsensteinFriday, September 06, 2013 6:59:00 PM
To repeat what Dr. Matzke [congratulations, Nick!] said: Aside from "alignment-free" methods, which use much less of the information than methods that require and alignment, there are Bayesian methods that jointly sample from the posterior distribution of alignments and trees, under some assumptions about the priors. It's early days yet in that literature, but it does eliminate the possible biases in making the alignment. It doesn't get you one phylogeny, or one alignment, but a cloud of pairs, each consisting of an alignment and a tree.

A search with terms: Bayesian alignment phylogeny will show some of this literature. One such program is BAli-Phy
ReplyDelete
Replies

Add comment