Next: Query compression Up: Compression Previous: Compression

First order compression

The techniques used in HPDS to implement compressed relations are fairly complex, so it is worth approaching them by a series of steps, starting with a `virtual' picture of how the compression works. Consider the simple relational database shown in table 2.1. It has a single relation CHIPSPEC that describes the characteristics of a number of chips.

Table 2.1: CHIPSPEC Database Contents

If this is stored as a file of records, we have to make each field of these records big enough to hold the largest value that any of the records stores in that field. Since, moreover, the database designer is not likely to know exactly how large the individual values may be, the tendency will be to err on the side of caution and make the fields longer than is strictly necessary. In this instance a designer might specify the following field widths in bytes.

With each tuple occupying 20 bytes the space occupied by the relation would be 140 bytes.

The basic compression technique used in HPDS involves reducing each field to an integer containing just sufficient bits to encode all the values that occur within the domain of that field in the database. Since there are seven distinct part names, the parts column of the relation could be represented as a list of 3 bit numbers. Similarly, since there are only 3 distinct values for the number of pins, these could be represented as 2 bit numbers.

Once similar compression has been applied to the other fields, the resulting tuples are found to be only 7 bits long. Using a character format of data without compression, the table itself occupies 20 bytes tuples = 1120 bits. Now it occupies only 49 bits.

The compressed form holds the essence of the relationships, abstracting from the particular entities between which these relationships hold.

To translate these abstract relationships into concrete ones, it is necessary to go through a dictionary.

In their simplest form, these dictionaries can be simple lists of the values that occur in the domain. One would expect that the dictionaries would be smaller than the sum of the corresponding entries in the original table, since:

The original table would have contained fixed width fields. For a name, for example, this would have to be long enough to hold the longest name expected. Since most names would be shorter, a more compact representation could be used in the dictionary.
In all columns but the simple primary key, there is likely to be repetition of field values. In our example, the strings `CMOS' and `TTL' occur repeatedly. Only one instance of these has to be stored in the dictionary.

In our previous example the compression might be as shown in table 2.2.

Table 2.2: Compression of CHIPSPEC

We assume that the compressed code for a field is given by its position in the corresponding dictionary. The dictionaries are shown in Table 2.3.

Table 2.3: Dictionaries for CHIPSPEC

Next: Query compression Up: Compression Previous: Compression

W Paul Cockshott
Fri Sep 6 10:29:18 BST 1996