Sequence alignment is one of the crucial steps in deciding whether two genes/proteins are homologous. The two sequences are aligned from one end to the other and the number of identical, or similar, residues is counted. If this number reaches a significant percentage of the total length (usually >25%) then the two sequences are homologous—they descend from a common ancestor.
Sequence alignment is not straightforward, even for two sequences, because in addition to substitutions the genes might have undergone insertions or deletions (indels). In order to identify conserved residues, one needs to insert gaps in one sequence or the other to compensate for these indel events.
You can't just willy-nilly stick in gaps to maximize the number of aligned residues because the gaps represent true historical events (insertions and deletions). In theory, you can get high identity scores with any two sequences as long as you insert enough gaps but that isn't allowed. When the alignment is done by computer algorithm, each gap is associated with a gap penalty.
The determination of proper gap penalties is a major challenge in multiple sequence alignment. A crude estimate is that each gap comes with a penalty of 3—that is you have to generate at least three identities in order to make the gap worthwhile. The number of gaps and gap penalties have to be subtracted from the identity/similarity scores when deciding about homology. (This isn't always done.)
Here's an example of a multiple sequence alignment from a region of bacterial HSP70 genes. The letters represent the amino acid residues and the dashes are gaps due to insertions and deletions.
The HSP70 genes are the most highly conserved genes in biology so, in principle, it should be easy to align them. In fact, it is easy in most regions but the one shown above is the most difficult. This is a manual alignment that takes into account the similarities of groups of sequences. Those that are most similar are clustered together and whenever possible the alignment is adjusted so that the positions of the gaps in the most closely related sequences are identical.
This is a procedure known as phylogenetic alignment but it would be better to call it similarity alignment because what we're actually doing is clustering sequences by their overall similarity and not their phylogeny. (The fact that their phylogenetic relatedness closely corresponds to their similarity is a consequence of the the analysis and not a cause.)1.
The placing of gaps in this region of HSP70 sequences is very difficult. No computer program can come close to achieving the quality of alignments that well trained humans can achieve. That's because the overall alignment has to take into account a number of variables simultaneously and the progressive alignment takes many trial-and-error steps. As a general rule of thumb, if you see a paper where phylogenetic trees are constructed using computer-generated multiple sequence alignments only, then you should assign a low confidence value to that work.
Is this important? Indeed it is. The exact nature and position of the large gap in the above sequences, for example, plays an important role in testing the Three Domain Hypothesis. Different alignments give different trees and the most important variable is the position of gaps.
This brings me to an important paper just published in this week's issue of Science. Löytynoja and Goldman (2008) have developed a new algorithm for multiple sequence alignment. The abstract of their paper describes the problem, and their solution.
Genetic sequence alignment is the basis of many evolutionary and comparative studies, and errors in alignments lead to errors in the interpretation of evolutionary information in genomes. Traditional multiple sequence alignment methods disregard the phylogenetic implications of gap patterns that they create and infer systematically biased alignments with excess deletions and substitutions, too few insertions, and implausible insertion-deletion–event histories. We present a method that prevents these systematic errors by recognizing insertions and deletions as distinct evolutionary events. We show theoretically and practically that this improves the quality of sequence alignments and downstream analyses over a wide range of realistic alignment problems. These results suggest that insertions and sequence turnover are more common than is currently thought and challenge the conventional picture of sequence evolution and mechanisms of functional and structural changes.The authors test their phylogeny-aware program (PRANK) against several other multiple sequence alignment programs (ClustalW, MAFFT, MUSCLE, and T-COFFEE) using a set of sequences that were "evolved" using a computer program that created substitutions and insertions/deletions. Since the true phylogeny of this artificial set is known, they were able to evaluate the performance of the various programs.
As you might expect, PRANK came out best in this test. I'm not sure that it would work best with real data but that's not really my point. My point is that this is an ongoing problem that has not been fully solved. It is still best to avoid multiple sequence alignments that have not been manually improved by humans with considerable experience in sequence alignment.
I'll close by quoting from the discussion in Löytynoja and Goldman (2008) just to remind everyone how important this is. They argue that even post-alignment human "refinement" of computer generated sequence alignments suffers from systemic bias.
Our analyses show that sequence alignment remains a challenging task, and alignments generated with methods based on the traditional progressive algorithm may lead to seriously incorrect conclusions in evolutionary and comparative studies. The main reason for their systematic error is disregard of the phylogenetic implications of gap patterns created—which is not corrected by considering alignment consistency (13) or using post alignment refinement (14, 15)—and this error is intensified by methods that intentionally force gaps into tight blocks. Affected methods can be positively misleading and become increasingly confident of erroneous solutions as more sequences are included. It is not the progressive algorithm as such that is defective, rather, correct alignment requires that we take account of sequences' phylogeny, irrespective of alignment method used or data type, but the original implementations of the progressive algorithm have a flaw that has gone unnoticed as long as different methods have been consistent in the error they create.
That such a significant error has passed undetected may be explained by the alignment field's historical focus on proteins, where these biases tend to be manifested in less-constrained regions such as loops (compare Fig. 1). Alignments with insertions and deletions squeezed compactly between conserved blocks may suffice for, and even be preferred by, some molecular biologists working with proteins. We have shown, however, that these patterns are, in fact, imposed by systematic biases in alignment algorithms, even in cases where they are incorrect and, indeed, phylogenetically unreasonable. We contend that algorithms that impose gap patterns like those found in structural alignments of proteins are inappropriate for the increasingly widespread analysis of genomic DNA and are likely to cause error when the resulting alignments are used for evolutionary inferences.
1. In a sense, phylogenetic alignment creates a circular argument. What we're trying to do is to build a phylogenetic tree from the multiple sequence alignments. If we use the presumed phylogeny to generate the alignments then we have a problem. Part of the problem goes away once we recognize that the alignment is driven by clustering similar sequences rather than phylogenetically related sequence.
Löytynoja, A. and Goldman, N. (2008) Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis. Science 320:1632-1635. [DOI: 10.1126/science.1158395]