Computing at Glasgow University
Paper ID: 7185
DCS Tech Report Number: TR-2003-135

The Suffix Sequoia Index for Approximate String Matching

Publication Type: Tech Report (internal)
Appeared in: DCS Tech Report
Page Numbers : 1-26
Publisher: Dept of Computing Science, University of Glasgow
Year: 2003

We address the problem of approximate string matching over protein, DNA, and RNA strings, using an arbitrary cost matrix. We focus on full-sensitivity searching, equivalent to the Smith-Waterman algorithm. The data structure we propose is compact. To index a text of length $n$, we use just over $4n$ bytes. This data structure is amenable to further compression. We show that the index scan of the approximate matching algorithm in which the dynamic programming matrix is computed has time complexity $O(mg^{d})$ for a pattern of length $m$, alphabet size $g$, and indexing window size $d$. We demonstrate the use of the suffix sequoia with the largest public repository of proteins, now storing 400 million symbols. Our results indicate that full-sensitivity approximate matching could be delivered at the speed comparable to that of heuristic matching. We believe that this opens the way for a database solution to the approximate sequence matching problem.

Keywords: string algorithms, database indexing, suffix tree, approximate matching, data structures

PDF Bibtex entry Endnote XML