SeqAn, A a generic C++ library for the analysis of biological sequences
Biological sequence analysis is the heart of computational biology. Many successful
algorithms (e.g., Myers' bit-vector search algorithm, BLAST, etc.) and data
structures (e.g., suffix arrays q-gram based string indices, sequence profiles)
have been developed over the last fifteen years. The assembly of large eukaryotes
genomes like Drosophila Melanogaster, Human, and Mouse are prime examples where
algorithm research was successfully applied to a biological problem. However, with
entire genomes in hand, large scale analysis algorithms that require considerable
computing resources are becoming increasingly important (e.g., Lagan, MUMmer, MGA,
Mauve). Although these tools use slightly different algorithms nearly all of them
require some basic algorithmic components, like a suffix array, exact or approximate
string searches, a chaining of fragments, or local alignments. The construction of
these components is, however, rather non-trivial. Therefore suboptimal data types
and ad-hoc algorithms are frequently employed. The lack of readily available,
sophisticated implementations of the accepted algorithms and data types greatly
hinders the rapid development of large-scale applications. In the tutorial we
present SeqAn, a generic C++ library for the analysis of biological sequences
(see www.seqan.de for more information
and publications). In the firt half of the tutorial we speak about the desig
principles of SeqAn. We show how genereic it is and what mechanisms guarantee high
performance. We discuss its content and what high level applications have already
been implemented. In the second half of the tutorial we give a hands-on example of
how to rapidly prototype a high level application in SeqAn.