<XML><RECORDS><RECORD><REFERENCE_TYPE>3</REFERENCE_TYPE><REFNUM>7831</REFNUM><AUTHORS><AUTHOR>Japp,R.P.</AUTHOR></AUTHORS><YEAR>2004</YEAR><TITLE>The Top-Compressed Suffix Tree: A Disk-Resident Index for Large Sequences</TITLE><PLACE_PUBLISHED>21st Annual British National Conference On Databases, Volume 2 </PLACE_PUBLISHED><PUBLISHER>N/A</PUBLISHER><PAGES>68--79</PAGES><ISBN>1-904410-12-X</ISBN><LABEL>Japp:2004:7831</LABEL><KEYWORDS><KEYWORD>Suffix Trees</KEYWORD></KEYWORDS<ABSTRACT>We present a novel data structure, the Top-Compressed Suf- fix Tree, that can be used to provide a scalable, disk-resident index over large biological sequences. This data structure has been designed to over- come a number of the limitations commonly associated with using suffix trees over large sequences. In particular, the Top-Compressed Suffix Tree supports incremental construction and on-demand faulting, allowing the use of indexes with a total size that greatly exceeds the amount of avail- able main-memory—an important property if we are to provide indexes over complete mamallian genomes. Additionally, we improve upon the prefix-partitioned suffix tree construction algorithm [4] by introducing a pre-processing stage and by demonstrating parallel index construc- tion. The Top-Compressed Suffix Tree implementation discussed here has been used to provide a disk-resident index over DNA sequences of up to 1.5 Gbp in length, occupying an average of only 8.17 bytes per character indexed. </ABSTRACT></RECORD></RECORDS></XML>