What is multiple sequence alignment?
December 15, 2024Table of Contents
Multiple sequence alignment
Definition and differences versus alignments of 2 sequences
When aligning more than two sequences, the process is referred to as multiple sequence alignment (MSA). This approach is more complex compared to aligning just two sequences, as the direct application of global and local alignment algorithms demands significant computational resources and is only feasible for a small number of sequences.
A key challenge in MSA is determining the optimal placement of gaps in the alignment. To ensure all sequences are of equal length, gaps must be introduced strategically, which complicates the alignment process.
Progressive construction algorithms
The most widely used programs are based on this type of strategy. They do not guarantee that the alignment obtained is the best possible, but they are able to find an optimal solution very effectively. Examples of programs based on these heuristic methods: ClustalW and T-coffe.
The method is to first perform two-in-two alignments. From these alignments a matrix of distances be- tween the sequences and a guide tree based on these distances are constructed. Through this tree we can find the most similar sequence pairs.
To the alignment of the most similar pair of sequences, the rest of the sequences or alignments are added in the order determined by the guide tree.
Clustal-derived programs (ClustalW and ClustalX (graph version)) incorporate an alignment scoring system. The score is based on the genetic distance between each sequence and the tree root, taking into ac- count the score for amino acid or nucleotide changes. In addition to this score, gaps and gaps are penalized. The global alignment score will be the sum of the sequence pair alignments score. ClustalW tends to place gaps between highly conserved areas rather than separate these regions.
Fig. Alignment of three highly preserved sequences
Fig. Alignment with gaps
Problems of progressive construction alignments
The biggest problems with this type of alignment stem from the alignment of the first pair. If the first two sequences are close (very similar) the base alignment will probably contain few errors. On the other hand, if these two sequences are very divergent, the alignment obtained will not be very adequate and the gaps and errors will spread to the rest of the added sequences since this first alignment is no longer modified when the rest of the alignments are joined. To alleviate this problem, it is better to align sequences of the
same size, that is, to only include in the alignment those regions present in all the sequences since the pro- gram will try to align the entire length of the sequence by introducing gaps in the rest.
This type of alignment works correctly for sequences with a certain degree of conservation and that vary more or less continuously. But despite these drawbacks, these types of algorithms often find a good solution with few resources, allowing analysis of many sequences.
Multiple alignment utilities
Multiple alignments are a fundamental tool. These alignments are essential for further studies, such as:
preserved domain analysis, secondary structures,
recognition of regions preserved in promoters, etc.
Multiple alignment editing
The multiple alignments generated by these programs are text files that in principle we could edit with any editor. But some formats are not comfortable to be handled by a text editor or the number of sequences or their length complicate editing.
There are specialized programs to edit alignments, one of the most complete is the Bioedit , which also al- lows us to link with different applications such as Blast, primer3, etc. It is a graphic editor that works in windows and, in addition, it is quite intuitive in its use.