Next: References Up: Acclerating databases by compression Previous: Performance and conclusions

Storage Efficiency

In order to predict the space occupancy of a transformed memory resident database, it is necessary to have a mathematical storage model which provides apriori information about expected storage volumes. A particular feature of this system is its data sensitivity, as the performance of dictionary compression techniques relies heavily on contextual and repeated value redundancy in presented relations. The pre-processing of candidate relational data, even on a sample basis, should clarify the situation and give a database administrator a better idea of potential benefits provided by the system.

The rationale and nomenclature underlying the storage model is explained.

: Seperable Domain Bit Stream length
: number of columns in the ith table
: cardinality of the ith table
: cardinality of the set of values in the jth column of the ith table
: width of the representation of the jth column of the uncompressed table
: bit width of the representation of the jth column of the compressed table
: storage required for the jth column of the ith uncompressed table
: storage required for tokens in the jth column in the ith compressed table
: storage for the dictionary for the jth column of the ith compressed table
: storage for the indexing engine for the jth column of the ith compressed table
: , storage required for the ith table, corresponding to O, etc.
: measure of compression for jth column of the ith table
: measure of compression for the ith table

Now we can start to develop a predictive model of compression, first note that ,

given that dictionary structures may be shared between field domains, we will apply a kronicker delta flag for computational distinctions . i.e.

Also we will define a common domain cardinality , as follows.

where represents the individual cardinality of seperable common domains.
Thus we have, assuming hash table optimal slack being a factor half occupancy.

A measure of compression for a single column is

In terms of distinct types of data compression we observe that the first term represents the elimination of contextual redundancy through the use of minimal width tokens, whereas the composite second represents the suppression of repeated values within the column through the use of a dictionary.
The storage requirements for the ith table, including it's indexing are

The corresponding compression measure for the ith table is

The overall measure of compression is

Next: References Up: Acclerating databases by compression Previous: Performance and conclusions

W Paul Cockshott
Fri Sep 6 10:29:18 BST 1996