<XML><RECORDS><RECORD><REFERENCE_TYPE>0</REFERENCE_TYPE><REFNUM>7765</REFNUM><AUTHORS><AUTHOR>Hunt,E.</AUTHOR></AUTHORS><YEAR>2004</YEAR><TITLE>Indexed Searching on Proteins Using a Suffix Sequoia</TITLE><PLACE_PUBLISHED>Bulletin of the IEEE Computer Society Technical Committee on Data Engineering </PLACE_PUBLISHED><PUBLISHER>IEEE Computer Society Press</PUBLISHER><PAGES>24-31</PAGES><LABEL>Hunt:2004:7765</LABEL><KEYWORDS><KEYWORD>suffix sequoia</KEYWORD></KEYWORDS<ABSTRACT>Approximate searching on protein sequence data under arbitrary cost models is not supported by database indexing technology. We present a new data structure, suffix sequoia, which reduces the time complexity of the dynamic programming (DP) matrix calculation required in approximate matching. The data structure is compact. It uses just over 4 Bytes per symbol indexed. We show that time complexity of the DP calculation is O(qg^{d}) for a pattern of length q, alphabet size $g$, and indexing window size d. The DP calculation requires no disk access, and can be executed efficiently. The second phase of the algorithm is based on sequential disk access, and appears to be effective. Approximate matching experiments are promising and offer a lot of scope for algorithm refinement and data structure engineering. </ABSTRACT></RECORD></RECORDS></XML>