Computing at Glasgow University
Paper ID: 7832
DCS Tech Report Number: TR-2004-165

The Top-Compressed Suffix Tree: A Disk-Resident Index for Large Sequences

Publication Type: Tech Report (internal)
Appeared in: DCS Tech Report
Page Numbers : 124
Publisher: Dept of Computing Science, University of Glasgow
Year: 2004

We introduce a new data structure, the Top-Compressed Suffix Tree, that can be used to provide a scalable, disk-resident index for large sequences. This data structure is based on the suffix tree, but has been designed to avoid the problems associated with using such structures on secondary memory. Top-Compressed Suffix Trees can be constructed incrementally, allowing indexes to be created that are larger than the amount of available main memory. Correspondingly, querying such an index only requires part of the data structure to be in main memory, thus we allow on-demand faulting and eviction of index sections during search. Such an index may be of great benefit to scientists requiring efficient access to vast repositories of genomic data. We present a detailed description of the data structure and relevant algorithms along with performance measurements. The construction algorithm used with our prototype includes a pre-processing stage that improves the performance of prefix-partitioned suffix tree construction when the size of the input data is similar to that of the available main memory. Our implementation of the Top-Compressed Suffix Tree has been used to provide compact, disk-resident indexes over biological sequences up to 1.5 Gbp in length, occupying an average of only 8.17 bytes per character indexed.

PS/PS.GZ PDF Bibtex entry Endnote XML