Statistical and Machine Learning Methods in Computational Biology June 12 - June 19, 2010 | Lipari School on BioInformatics and Computational Biology

Alfredo Pulvirenti, University of Catania, Italy

Two and Multiple Sequence Alignment
Sequence alignment is a basic step in comparative sequence analysis. Although it is a classic bioinformatics problem, design tools able to produce high-quality alignments for distantly related sequences is still challenging.
In this tutorial algorithmic approaches to sequence alignment will be surveyed. The basic computational formulation of pairwise alignment together with its dynamic programming formulation will be introduced. Next, the much harder multiple sequences case will be deeply examined. Special focus will be given to key techniques able to provide biologically sound results such as: probabilistic consistency, segment based alignment, exploiting additional not sequence-related information (structure, profiles, etc..), gaps handling, adding flexibility for aligning proteins from different domain architectures, improving scalability.
Finally MSA validation issues such as standard de facto benchmarks, scoring systems, and statistical significance will be discussed.

Bibliography:

Notredame,C. et al. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217.

Thompson,J.D. et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-speciﬁc gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680.

Pei J and Grishin NV, (2007) PROMALS: towards accurate multiple sequence alignments of distantly related proteins, Bioinformatics, 23: 802–808

Schwartz AL and Pachter L, (2007) Multiple alignment by sequence annealing, Bioinformatics 23:24-29

Do CB et al. (2005): ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res, 15:330-340.

Raphael B et al. (2004): A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Res, 14:2336-2346.

Löytynoja A and Goldman N (2005), An algorithm for progressive multiple alignment of sequences with insertions, Proc. Natl. Acad. Sci 102, 10557–10562.

Edgar RC (2004): MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32:1792-1797.

Rausch T et al. (2008). Segment-based multiple sequence alignment. Bioinformatics, 24(16), i187–192.

Kececioglu,J.D. (1993) The maximum weight trace problem in multiple sequence alignment. In CPM ’93: Proceedings of 4th Annual Symposium on Combinatorial Pattern Matching. Springer-Verlag, London, UK, pp. 106–119.