next up previous
Next: Difficulties in the Up: Acclerating databases by compression Previous: Acclerating databases by compression

Design Motivation

Most database management systems are constructed with the underlying assumption that the data to be manipulated will be largely stored in secondary storage. Main memory storage is used for intermediate products of queries or as a source of storage for only the most recently or frequently used data. The typical size of databases is increasing and applications which require to use hundreds of gigabytes are now beginning to emerge. The performance of disk-based database systems has always been a significant concern, particularly in the relational model where performance of an application may be dependent on the performance of the optimiser rather than the skill of the database programmer. The continual increase in the size of typical database applications is matched by a progressive reduction in the cost of main memory. The increasing economic viability of main memory databases has prompted a considerable amount of research effort to be devoted to the investigation of main memory database systems (Garcia-Molina and Salem 1992).

A typical approach to data representation in such systems is to characterise domain values as a sequence of strings. Tuples are represented as a set of fixed length pointers to corresponding domain values (Pucheral et al 1990). Each domain is associated with an inverted list of object identifiers which identify the tuples containing the domain values. References can therefore be made in both directions between tuples and domains.

This approach provides significant compression but leaves each domain value represented by a fixed length pointer which would typically be a machine word in length. An alternative strategy is to store a fixed length reference number for each domain value instead of a pointer to the value (Goldstein and Strnad 1970). The benefit of this approach is that in static file structures, careful choice of the reference number will provide optimal packing of tuples. Goldstein and Strnad also identify the need to vary compression strategy to match relations which contain different types of data.

An optimal strategy for generating lexical tokens is describe by Wang and Lavington (1989). This approach provides for the possibility of representing domain values in dynamic files by a series of minimal length reference numbers and gives the potential for the most efficient packing of compressed tuples.

Estimates of the query cost of accesssing database structures are mainly base on the assumption that tuple values will be stored in uncompressed format on secondary storage. Early work on partial match queries focussed on the use of single key attributes (Rivest 1976) and was subsequently developed to allow for dynamic file structures (Fagin et al 1979).

Strategies have been propose which will optimise data access on the basis of multiple independent attributes specified in a query (Aho and Ullman 1979). Such functions allow for partial specification of secondary keys. The independence characteristic requires the assumption that all possible queries are equally likely however, in any realistic system, many of the potential queries will never be executed. This leads to different patterns for optimising multidimensional queries (Lloyd 1980).

This article describes an experimental database system developed at the University of Strathclyde which aims to achieve enhanced performance by the use of a combination of data compression and persistent programming techniques. Experience with the system has indicated that very significant gains in database performance and significant reductions in data volume become possible using these techniques.

Orthogonal persistence[Atkinson83][Cockshott83] is a technique by which a program sees a uniform view of data irrespective of its lifetime. Thus any datastructure that a programmer can declare in a programming language can be made to persist over several progam invocations without the programmer having to make any special effort to ensure this. Reviews of implementation techniques for persistent stores are provided in [Cockshott85] and [Suzuki94]. Its attraction from the standpoint of database implementation is that, by concentrating the data being worked on in fast random access store it offers the promise of considerable performance gains when compared to orthodox designs which are centered on disk files. If one assumes that ones data is basically stored in RAM not disk, the cost of following a pointer becomes' modest. A consequence is that it becomes possible to use more sophisticated and complex datastructures without having to pay an excessive performance penalty. In particular, the use of data compression becomes a more attractive option. This is both because a compressed database will have a smaller working set and thus perform better with a given RAM size, and also because the costs of indirection, which are implied by the use of dictionaries to tokenise the database, now become supportable.

next up previous
Next: Difficulties in the Up: Acclerating databases by compression Previous: Acclerating databases by compression

W Paul Cockshott
Fri Sep 6 10:29:18 BST 1996