Paper ID: 7185
DCS Tech Report Number: TR-2003-135
The Suffix Sequoia Index for Approximate String Matching
Hunt,E.
Publication Type:
Tech Report (internal)
Appeared in:
DCS Tech Report
Page Numbers : 1-26
Publisher: Dept of Computing Science, University of Glasgow
Year: 2003
Abstract:
We address the problem of approximate string matching over
protein, DNA, and RNA strings, using an arbitrary cost matrix.
We focus on full-sensitivity searching, equivalent to the Smith-Waterman
algorithm.
The data structure we propose is compact. To index a text of length $n$, we use
just over $4n$ bytes. This data structure is amenable to further compression.
We show that the index scan of the approximate matching algorithm in which the
dynamic programming matrix is computed has time complexity
$O(mg^{d})$ for a pattern of length $m$, alphabet size $g$, and
indexing window size $d$.
We demonstrate the use of the suffix sequoia with the largest public repository
of proteins, now storing 400 million symbols. Our results indicate that
full-sensitivity approximate matching could be delivered at the speed comparable
to that of
heuristic matching.
We believe that this opens the way for a database solution to the
approximate sequence matching problem.
Keywords: string algorithms, database indexing, suffix tree, approximate matching, data structures
|