Pharmacogenomics 7 - July 14, 2012 | Lipari School on BioInformatics and Computational Biology

Alberto Apostolico, Georgia Tech, Atlanta, USA - University of Padova, Italy

Compositional Sequence Analysis and Classification
The problem of comparing, classifying and indexing long textual files from large collections is becoming increasingly severe as web applications, digital libraries and genomic studies expand to an unprecedented scale. Established techniques of the past rarely work in these contexts. In computational molecular biology, for instance, edit distances become both computationally unbearable and scarcely significant when they are applied to entire genomes, and are being supplanted by global similarity measures that refer, implicitly or explicitly, to the composition of sequences in terms of their constituent patterns. Among these, measures of sequence similarity and distance based more or less explicitly on subword composition are attracting an increasing interest driven by intensive applications such as massive document classification and genome-wide molecular taxonomy. A uniform character of such measures is in some underlying notion of relative compressibility, whereby two similar sequences are expected to share a larger number of common substrings than two distant ones. This tutorial reviews some of the approaches to sequence comparison based on subword composition and suggests that their common denominator may ultimately reside in special classes of subwords, the nature of which resonates in interesting ways with the structure of popular subword trees and graphs.

References

A.Apostolico - Maximal Words in Sequence Comparisons Based on Subword Composition. Algorithms and Applications 2010, LNCS 6060, pp. 34-44. Springer, Heidelberg (2010).
Papers posted at the Wikipage of the 2011 Workshop on Alignment-Free Sequence Comparison https://wiki.cc.gatech.edu/alignmentfree/index.php/Main_Page (access will be granted during the School)