<XML><RECORDS><RECORD><REFERENCE_TYPE>10</REFERENCE_TYPE><REFNUM>7185</REFNUM><AUTHORS><AUTHOR>Hunt,E.</AUTHOR></AUTHORS><YEAR>2003</YEAR><TITLE>The Suffix Sequoia Index for Approximate String Matching</TITLE><PLACE_PUBLISHED>DCS Tech Report</PLACE_PUBLISHED><PUBLISHER>Dept of Computing Science, University of Glasgow</PUBLISHER><PAGES>1-26</PAGES><ISBN>TR-2003-135</ISBN><LABEL>Hunt:2003:7185</LABEL><KEYWORDS><KEYWORD>string algorithms</KEYWORD></KEYWORDS<ABSTRACT>We address the problem of approximate string matching over protein, DNA, and RNA strings, using an arbitrary cost matrix. We focus on full-sensitivity searching, equivalent to the Smith-Waterman algorithm. The data structure we propose is compact. To index a text of length $n$, we use just over $4n$ bytes. This data structure is amenable to further compression. We show that the index scan of the approximate matching algorithm in which the dynamic programming matrix is computed has time complexity $O(mg^{d})$ for a pattern of length $m$, alphabet size $g$, and indexing window size $d$. We demonstrate the use of the suffix sequoia with the largest public repository of proteins, now storing 400 million symbols. Our results indicate that full-sensitivity approximate matching could be delivered at the speed comparable to that of heuristic matching. We believe that this opens the way for a database solution to the approximate sequence matching problem. </ABSTRACT></RECORD></RECORDS></XML>