Alignment-free Classification and Comparison of Biological Sequences and Structures
Similarity of sequences is a key mathematical notion for Classification and
Phylogenetic studies in Biology. It is currently primarily handled using alignments.
However, the alignment methods seem inadequate for post-genomic studies since they
do not scale well with data set size and they seem to be confined only to genomic
and proteomic sequences. Therefore, alignment-free similarity measures are
actively pursued. Those measures offer the additional advantage of being
applicable to structural data, as long as the structure is encoded into a string.
A series of recent ground-breaking results have given evidence that those
techniques are valid alternatives to classic methods, based on alignments, for
classification and comparison of biological sequences and structures. In this
tutorial, the state of the art in this area is presented. In depth attention
will be given to the Universal Similarity Metric, a methodology based on
Kolmogorov Complexity. In particular, the first set of extensive quantitative
experiments showing that USM is successfully applicable in Biology will be
presented.
References:
Ferragina P, Giancarlo R., Greco V., Manzini G., Valiente G. (2007). Compression-based classification of biological sequences and structures via the Universal Similarity Metric: Experimental Assessment, BMC Bioinformatics , 8 :252.
Gilbert D, Rossello F, Valiente G, Veeramalai M. (2007). Alignment-Free Comparison of TOPS Strings, In London Algorithmics and Stringology 2006 , Volume . Edited by Daykin J, Mohamed M, Steinhofel K, College Publications:177-197.
Cilibrasi R, Vitanyi PMB. (2005). Clustering by Compression. IEEE Tr. Inform. Theory , 51 :1523-1545.
Keogh E, Lonardi S, Rtanamahata C. (2004). Towards parameter-free data mining, Proc. 10th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, ACM :206-215.
Krasnogor N, Pelta DA. (2004). Measuring the Similarity of Protein Structures by Means of the Universal Similarity Metric, Bioinformatics , 20 :1015-1021.
Li M, Chen X, Li X, Ma B, Vitanyi PMB. (2004). The Similarity Metric, IEEE Tr. Inform. Theory , 50 :3250--3264.
Vinga S, Almeida J. (2003). Alignment-Free Sequence Comparison: A Review. Bioinformatics , 19 :513-523.