Thursday, September 05, 2013

The Importance of Sequence Alignments

There are several required steps in constructing phylogenetic trees from sequence data. The first step is to align the sequences so you can make direct comparisons. It used to be the case that multiple sequence alignments had to be checked manually because none of the available computer programs were as good as an experienced scientist. That hasn't changed. What's changed is that the data sets have become so large and complicated that nobody wants to even look at the sequence alignments to see if they can be improved.

Drew et al. (2013) suggest that sequence alignments should be made available.
Until recently, uploading sequences to GenBank (or EMBL) was generally considered sufficient to ensure reproducibility of phylogenetic studies using DNA sequence data. Increasingly, however, the systematics community is realizing that archiving raw DNA sequences is not adequate, and that the underlying alignments of DNA sequences as well as the resulting phylogenetic trees are pivotal for reproducibility, comparative purposes, meta-analyses, and ultimately synthesis. Indeed, there has been a growing clamor for journals to adopt and enforce more rigorous data archiving practices across diverse disciplines [4]–[8]. As a result, about 35 evolutionary journals [5],[9] have adopted policies to encourage or require authors to upload alignments, phylogenetic trees, and other files requisite for study reproducibility [5] to TreeBASE (http://treebase.org/) and/or other public repositories such as Dryad (http://datadryad.org). Unfortunately, enforcement of such data deposition policies is generally lax, and most journals in systematics and evolution still do not require DNA sequence alignment or tree deposition. As a result, the alignments and trees underlying most published papers in systematics/phylogenetics and evolutionary biology remain inaccessible to the scientific community at large [8],[10].
I sympathize with the goal but I doubt that it can be achieved. I strongly suspect that many scientists don't even bother to produce sequence alignments. They just feed the electronic data directly into their tree-making algorithm.

I wonder how many anomalies could be resolved if they just looked at the alignments? Would they even know if bad sequence data was being used for one or two species in their alignment?


Drew, B.T., Gazis, R., Cabezas, P., Swithers, K.S., Deng, J., Rodriguez, R., Katz, L.A., Crandall, K.A., Hibbett, D.S., and Soltis, D.E. (2013) Lost Branches on the Tree of Life. PLoS Biol 11(9): e1001636. [doi: 10.1371/journal.pbio.1001636]

20 comments :

  1. Agreed that alignment quality is key, but I wonder why you think the goal is unrealistic?

    ReplyDelete
  2. I don't know of any tree-making programs that don't require an alignment as input. Do you? (POY may be an exception, but I seem to recall that it demands at least a few anchor points.) So yes, everyone is actually producing alignments, even if they're using automated data pipelines, whether or not they pay any attention to how it's being done.

    And I agree with the author that public availability of alignments is crucial to reproducibility of science. There's no excuse for journals failing to enforce that policy. If you're producing sequence alignments and aren't depositing them in Treebase, shame on you.

    Alignment programs, by the way, are now much better than they used to be. And another tool available today that wasn't a while ago is the iterative process that builds a tree from an initial alignment, then uses that as a guide tree for a new alignment, and so on until topology stabilizes, which is an attempt at approaching the goal of the globally most likely tree-alignment combination.

    ReplyDelete
  3. There are quite a few tools for alignment-free phylogeny construction - the idea is to average over the alignment uncertainty instead of basing the phylogeny on a single point estimate of the alignment. In principle this should make phylogeny construction more robust (if the tools are implemented sensibly; and off the cuff I'd say there may be unanswered questions on how to handle indels in these approaches). However, I'm not familiar with any of the existing tools that do this so I can't give an opinion on how well they work in practice.

    ReplyDelete
    Replies
    1. John,

      Here's a review of alignment-free methods: Click here.

      I believe that they are becoming more and more prevalent.

      Delete
    2. Those appear to be alignment-free methods of sequence comparison, not phylogenetic analysis, at least based on the abstract. Like BLAST.

      Delete
  4. I always eyeball my alignments, regardless of what algorithm I might use to get an initial alignment. The problem is, as Larry alludes to, the size of the datasets. I am currently working on a gigantic alignment (more than 200 taxa, ~15 kb of sequence for each) and not only is it difficult to get a program to even give me an initial alignment, but the prospect of then having to inspect it by eye is giving my the willies...

    ReplyDelete
    Replies
    1. Hey, we had an alignment of 169 species x 25kb, all checked laboriously by eye. Hard work, but there was no adequate substitute, and we had a dozen or so people doing it. It does appear (from unpublished tests) that an automated, iterative alignment procedure (SATE) can produce similar results, or at least similar trees, without eyeball alignment. But that makes me nervous.

      The real problems arise with you're trying to do the same thing with megabases of sequence. No numbers of eyeballs are going to handle that.

      Delete
  5. "I strongly suspect that many scientists don't even bother to produce sequence alignments. They just feed the electronic data directly into their tree-making algorithm."

    No, almost all phylogenetics programs require that you input an alignment, not the raw sequences. What you do is write a pipeline that feeds the raw data to an alignment program like ClustalW, THEN blindly feed that to a phylogenetic program ;-).

    Seriously, though, what is appropriate depends on the situation. If the sequences are 95% identical, an automated program would be fine and there is no point wasting time with a manual alignment. If similarity is weak, then you need to be much more careful.

    If the dataset is huge, you have no choice but to automate, but also you have some hope that any noise in alignment will average out.

    http://www.ncbi.nlm.nih.gov/pubmed/16012107

    ReplyDelete
  6. Oops, meant to add, the link at the bottom is an exception, joint Bayesian inference of the alignment and the phylogeny.

    ReplyDelete
  7. What Josh Harshman said. Leaving NGS aside for now...what tree building programs don't require alignments? For multilocus data sets, you edit and align raw ABI files in Sequencher, CodeonCode Aligner or Mesquite before using anything like RAxML.

    I think publishing your primers sequences in the manuscript and making your tissue/DNA available upon request is much more important than having those alignment sequences available.

    NGS data is a whole new beast. We've been using CLC Genomics Workbench for editing, alignments and de novo assembly. Not perfect by any means.

    ReplyDelete
    Replies
    1. Well, there are simultaneous alignment/phylogeny systems such as Tandy Warnow's Sate (http://www.cs.utexas.edu/~phylo/software/sate/) -- in such systems no pre-existing alignment is needed -- the alignment and phylogeny parts are integrated and iteratively improve each other. It's a clever idea, although I haven't actually used it to create any trees I've used in publications.

      Delete
    2. SATE isn't really a simultaneous alignment/phylogeny system. It's sequential and iterative, and requires an initial guide tree to build the initial alignment. I suppose you might also use an initial alignment to build an initial guide tree, but at any rate the steps are sequential. The iterative approach is interesting and useful, but it isn't simultaneous.

      Delete
    3. Oddly enough, my Field Museum ID said Josh, and I never bothered to correct it. I was wondering if you were someone I know in disguise.

      Delete
  8. To repeat what Dr. Matzke [congratulations, Nick!] said: Aside from "alignment-free" methods, which use much less of the information than methods that require and alignment, there are Bayesian methods that jointly sample from the posterior distribution of alignments and trees, under some assumptions about the priors. It's early days yet in that literature, but it does eliminate the possible biases in making the alignment. It doesn't get you one phylogeny, or one alignment, but a cloud of pairs, each consisting of an alignment and a tree.

    A search with terms: Bayesian alignment phylogeny will show some of this literature. One such program is BAli-Phy

    ReplyDelete
    Replies
    1. One question that arises is one of practicality. How much time would be required to come to convergence in a reasonably-sized (taxa x sequence) data set, say 100 x 25,000, with frequent indels? A day? A month? Your lifetime?

      Delete
    2. Don't know, but you could run BAli-Phy and try to find out.

      The method is useful as a conceptual exercise, whether or not it is practical. to clarify how the iinference of the alignment and the inference of the tree affect each other.

      Delete
    3. I should add that the great pioneer of understanding the interaction of multiple sequence alignment and tree inference is David Sankoff. His 50 years of work, including his present status as the central figure in computational comparative genomics, were celebrated at a special symposium, MAGE which happened two weeks ago. David's first paper on trees and alignments was the Sankoff-Morel-Cedergen paper in Nature New Biology 40 years ago.

      Delete