Multi-column hashing

Next: Rapid traversal of Up: Indexing Previous: Indexing

Multi-column hashing

The theory of single dimensional hashing, is well established and documented in the literature [Knuth,Fagan]. Its advantage over other techniques being that it provides an access time that is essentially independent of the size of the set being indexed. Multi-dimensional hashing extends this theory to a hyper-dimensional hash space, whose population density is dictated by a load factor. Any point in the search space is identifiable through the constructiuon of a multi-dimensional hash vector.

Table 3.1: Indexing of CHIPSPEC

Suppose that we wish to construct an index on the table shown originally in table 2.2. We chose to construct a composite index on the chipname and pin count fields of the relation. Given that we have a relation with 7 records we construct a hash table as shown in table 3.1, which points into the array of records. We form the hash codes by taking the tw0 least significant bits from the name field and one from the pin number field. Since the domains are compressed, the least significant bits provide a good hash function on the data. This gives us a range of eight possible hash codes. Linear probing is used to deal with clashes.

The hash codes can be considered either as three bit binary numbers, or, as a pair of hash indices covering the name and pin number dimensions, being two bits for the former and one for the latter. The advantage of the latter interpretation is that it allows one to scan along the dimensions of the hash code independently when performing partial match queries. Suppose we wish to find all 40 pin chips. Since the code for 40 pins is 00, the corresponding hash codes that will locate them are XX0, with X standing for undefined, i.e., 000, 010, 100, 110. Visiting these positions in the hash table, we find that records 0, and 5 are indicated. These are indeed the records of 40 pin chips. In addition, position 001 in the table has to be visisted, because of a clash on position 000. So a total of five locations are visited for two records returned.

The complexity of such a query will, for a given hash table loading, vary as where u is the number of unspecified bits in the hash code. It is clear that the best possible complexity measure for a query algorithm would be one whose computational cost was proportional to the cardinality of the set returned as an answer. If the fields of the relations were statistically independent of one another, the above algorithm would have the desired complexity measure, since would be proportional to the number of records matching the query.

In the above example the name field is a unique id for the records, and so the name field and pin number fields are not statistically independent. Thus, if a query is made for all chips whose name is ep900, then locations 0, 1, and 4 would be visited for only one record returned. These inefficiencies on primary key access obviously grow exponentially with the number of hash bits assigned to non-key fields. Since the sizes of the tables being indexed are not fixed, it it necessary to periodically add new bits to the hash code, and rehash, when the loading of the hash tables becomes excessive. This process is automatic, adding bits to the hash codes from fields in proportion to the bitlengths of their encoded representations. The selection of which fields are to be included in an index is done by human intervention. We intend in the future to incorporate the mechanisms described by Lloyd [Lloyd80] to optimise this.

Next: Rapid traversal of Up: Indexing Previous: Indexing

W Paul Cockshott
Fri Sep 6 10:29:18 BST 1996