Algorithm for the fast construction of codebooks for hierarchical vector quantization
Paul Cockshott, Robert Lambert
Tuesday, 27 January 1998
We present a recursive algorithm designed to construct quantization tables and codebooks for the hierarchical vector quantization of images. The algorithm is computationally inexpensive and yields high quality codebooks.
Vector Quantization (VQ) has been found to be an efficient technique for image compression . Following Shannon’s rate-distortion theory, a better compression is achievable by coding vectors rather than scalars. VQ represents a mapping from a k-dimensional space Rk to a finite subset Y of Rk. This finite set Y is called a codebook. To code a data sequence, an encoder assigns to each data vector x in the set Rk an index corresponding to a vector in Y, that in turn is mapped to a code-word y in the set Y by a decoder.
In its simplest implementation for image compression, VQ requires that an image is divided into fixed sized non-overlapping sub-image blocks, each represented by a data vector x in the set Rk. The pixels in the sub-image block correspond to the elements in the data vector x. Each image data vector is compared with the entries of a suitable codebook Y and the index i of the codebook entry yi most similar to the source data vector is transmitted to a receiver. At the receiver, the index accesses the corresponding entry from an identical codebook, allowing reconstruction of an approximation to the original image.
While simple, this implementation can only compress at a single compression ratio determined by the size of the block used to partition the image and the number of entries in the codebook. Changing either the block size or the codebook size will allow the compression ratio to vary, but involves the training and storage of many codebooks to take account of all block and codebook sizes used if the compression ratio is to be a dynamic variable.
An alternative is to use several distinct block sizes to encode an image with a codebook assigned to each block size. Variable dimension vector quantization operates on the principle that large blocks should be used to encode areas of low detail, while small blocks should encode areas of high detail. It represents a technique whereby an image can be compressed to a pre-determined compression ratio by defining, for instance, the number of blocks overall that may be used in the encoding of the image.
Ideally a single universal codebook, appropriate to the block size adopted and available to both compressor and decompressor, should be used. Increasing the size of the blocks can in principle raise degree of compression. However, to maintain quality the size of the codebook must be increased as the block size is increased. This limits the practicality of codebooks for large block sizes given the memory required for storage of the code words and the time required to identify the optimal match.
Memory requirements and search time can be traded for compression complexity and compression quality by adopting a coding scheme that takes advantage of correlation in the image data. For example, setting the mean intensity of each coding block to zero before it is vector-quantized can reduce the size of the codebook. Normalising block variance (or contrast) can achieve a further reduction in codebook size in addition to standardising the mean intensity.
To keep the size, search time and memory required by the codebooks low, the codebook vectors are normalised with respect to contrast and mean intensity. A block approximation by a codebook is thus a triple consisting of the codebook index, a contrast adjustment and brightness offset. The quantization of the contrast and means intensity values to integers; the size and content of the codebook and the overhead of encoding block position determine the quality of the compression.
The basic problem is to search a codebook, containing for example 256 small archetypal 4 x 4 pixel images and find which of these most closely corresponds to a given 4 x 4 pixel area in the picture being compressed.
For the sake of simplicity we will restrict our initial discussion to the problem of searching codebooks whose patches are 4 pixels by 4 pixels in size. If this problem can be solved, one can readily handle patches of larger sizes. One simply constructs codebooks with larger patches that are appropriately scaled up versions of the 4 x 4 patches. Subsequently, when looking for a match, you subsample the area of the image one is trying to match, match it against the appropriate 4 x 4 patch and use the corresponding larger patch when reconstructing the image.
If we perform a linear scan of the codebook (Fig 1), the complexity of the search operation will be such that it involves 256 x 4 x 4 basic comparison operations, or, more generally takes a time proportional to the number of codebook entries. What we require is a lower complexity operation, analogous for example to hashing, whose time order is a small constant independent of the number of codebook entries. This could be achieved were we able to replace the linear search with an appropriate table lookup operation.
Figure 1 Linear scanning, this is conceptually the easiest way to find a matching entry from the codebook
A 4 x 4 pixel patch can be considered as a 16-element vector. The problem of finding the best entry in a codebook is then one of determining which of n points in a 16 dimensional metric space is closest to a given sample point (sample patch), n being the number of entries in the codebook.
Given sufficient memory, one could in principle, construct a vast lookup table, a 16 dimensional array that used the original 16 co-ordinates specifying a patch as array indices to locate an index number for the appropriate codebook entry. This would require an array with 2 128 entries, which is technically infeasible.
Hierarchical vector quantization is a technique used in image and sound compression for using table lookup operations to find the nearest entry B[i] from a codebook B of vectors to a source vector v. We can illustrate how it works with two examples.
1. Suppose that we are dealing with two-dimensional vectors, whose members are byte-integers. Suppose that our codebook B0 contains 256 entries or less. We construct a two-dimensional array of byte-integers T0 such that B0[T0[i,j]] is the closest entry in B0 to the vector [i, j ]. Finding the closest entry in the codebook to a given input vector can then be determined simply by looking up the table T0.
2. Suppose that we are dealing with 4 dimensional vectors [i, j, k, l ], again expressed as byte-integers. Using table T0 we can map these into two-dimensional vectors [T0[i,j], T0[k,l]]. Suppose that our codebook B1 contains 256 entries or less, each of which is a 4 dimensional vector [p, q, r, s]. We construct a two-dimensional array of byte-integers T1 such that B1[T1[T0[i,j],T0[k,l]]] is the closest entry in B1 to the vector [B0[T0[i,j]],B0[T0[k,l]]]. Finding the closest entry in the code-book to an given input vector can then be determined by looking up the table T0, on adjacent pairs of vector elements, and then applying T1 to the resultant pair.
It is clear that the method can be applied recursively to match vectors of any power of two dimensions with entries in a codebook. The complexity of the operation is proportional to the number of dimensions of the vectors being compared and independent of the size of the codebooks.
Figure 2 Codebook lookup could in principle be reduced to an array indexing operation. It would just require a rather large array.
The great advantage of array lookup operations is that they can be very fast, and although looking up an array of 2128 entries is not possible, looking up smaller arrays is. If one were to use a hierarchy of lookup operations, cascading lookups on a series of 2 dimensional arrays each holding 216 entries, then the entire process could be performed in 15 lookup operations as shown in figure 3. This allows very fast video compression to be performed. We find that we can compress CIF sized video sequences at 12.5 frames per second on a 300Mhz Pentium.
Whilst this shows that it is in principle possible to reduce the codebook search to a series of lookup operations, it leaves us with the not inconsiderable problem of deriving a set of tables to use in the lookup process.
If we look at figure 3, it is clear that the lookup process is essentially recursive. Pairs of 8 bit values are combined to form 16 bit table indices, resulting in more 8-bit values, which are themselves combined to word length indices, etc. This gives us a good hint that the process of constructing the tables is itself likely to be a recursive process.
At each level of the recursion during lookup we are reducing two dimensions to one dimension, two byte indices to a single byte output. Let us for a moment allow ourselves to assume that what we are dealing with at each input level are two-dimensional metric spaces. Since the overall aim of the whole lookup process is to find which 16 dimensional vector from our codebook is nearest to a given input vector, let us assume that this overall aim can be reflected at the two dimensional level. In that case the problem of constructing the individual lookup tables can be considered that of tabulating for each point in the two dimensional space the index of the output vector that is closest to it.
Consider the first level of tables. These take as indices two pixels. The pixels can be considered as orthogonal axes in a two dimensional space. We want to generate as output an 8 bit index number that identifies one of 256 vectors, each of two dimensions and for the two dimensional vector so identified to be the one that is most similar to the pair of input pixels. We view the process as having a codebook of 256 pixel pairs, and for all possible pairs of 8 bit pixels determining which codebook entry best approximates it.
Figure 3 How 15 table lookup operations can map 16 pixels to a codebook entry.
Figure 4 Approximation of pairs with archetypes
If we use the brightness values of a pair of adjacent pixels to define the x and y axes of a graph, then the putative 2-pixel codebook consists of a scatter of points that are to serve as archetypes for sample points. Around each archetypal point there will be a domain of attraction within which all sample points are mapped to the archetype. The domain of attraction of an archetype a consists of all points p such that there exists no archetype b for which d(b, p)<d(a, p), for some metric function d( ). We would expect these domains to polygonal. This is illustrated in Figure 4.
There is an obvious weakness with the scheme outlined above - we do not start out with a set of archetypes at the 2-pixel level. We must somehow derive these small-scale archetypes.
There is also a subtler problem. The cascaded table lookup process shown in figure 3 is essentially recursive . The mapping tables shown in figure 3, actually implement the domains of attraction illustrated in figure 3 by establishing a mapping between points in the plane and their attractors. This implies that there must be analogous diagrams to figure 4 that can be constructed to represent the level 2, level 3 etc mapping tables. However when we come to consider the level 2 mapping table, the pair of numbers that will be fed into it are no longer pixels, but codebook index numbers for a 2-pixel codebook.
When it is a question of looking up a table, this is of no import, from the standpoint of the computer’s indexing hardware, a number is a number irrespective of what it represents. To address an array one simply needs a collection of bits. But when it is a question of constructing a map of the domains of attraction like figure 8, it does matter what sort of numbers are used to construct the axes. Using bytes to determine co-ordinate positions according to some axes implies that the set of possible bit patterns within the bytes are fully ordered. It implies that the pattern 1000001 is adjacent to 10000010, etc. This is a prerequisite for the construction of domains of attraction as per figure 5. Such a domain of attraction groups neighbouring points, but the notion of neigbouring is only defined if the set of co-ordinate pairs constitute a metric space. This implies, in its turn, that the axes themselves must be metric spaces - one-dimensional in this case. Thus the codes that we use to label archetypes must be metric spaces.
Let T be a mapping table as shown in figure 3. Let a, b, g, d be points in a two-dimensional space as shown in figure 4. We want neighbouring points a, b, to have codes T[a], T[b] that are themselves close together. We want points in the plane g, d that are far apart to have codes T[g]. T[d] that are themselves far apart. That is to say, we want |T[a] - T[b]| to be positively correlated with d(a,b) for all a,b in the plane, and for some appropriate metric d.
We have deduced that the construction of the mapping tables must meet the following constraints:
the construction process must be recursive
at each level it must tabulate basins of attraction
at each level it must derive the attractors of the basins of attraction
it must assign index numbers to the attractors such that distances in index number space are strongly correlated with distances in 2-D space.
To use it effectively we need to construct a series of tables T0, 1, ..., m and codebooks B0,1,...,m where m is log2 of the number of pixels in the largest patch that we wish to use for compression.
The existing approach is to develop the codebooks first and then derive the tables from them. Given a codebook Bm-1 and a codebook of Bm of the next higher level, the index entry for Tm[i,j] is chosen such that Bm[Tm[i,j]] will be the closest approximation to [Bm-1[i],Bm-1[j]] in the codebook at level m.. Since in this approach the construction of the codebook is disconnected from the construction of the index tables there is no guarantee that the final codebook will be efficiently used. One might for instance find that for instance the 43rd entry in the higher level codebook was never selected as a good approximation for any pair of lower level codebook entries. It can thus happen that a significant portion of ones higher level codebook is never used. The method that we present below is aimed to avoid this inefficient use of the codebook. It attempts to construct tables and codebooks which maximise the signal to noise ratio of the reconstructed image and which ensures that all codebook entries will be used on ‘typical’ image sequences. Typical sequences are here understood to mean sequences with similar statistical properties to those of the training set.
The Construction algorithm
Our algorithm will construct a series of mapping tables T1, T2,... such that T1ij specifies the attractor index to be output at level 1 when given a pair of horizontally adjacent pixels i,j. At the next level of the hierarchy, T2 will map vertically adjacent pairs of T1-mapped horizontally adjacent pairs of pixels. The process is shown schematically in figure 5, where the pixel square
is shown being passed through 4 levels of mapping to give a single output from the last table.
Figure 5 Hierarchy of table lookup operations used in the compression process, showing how horizontal and vertical groupings alternate at different levels.
The first phase in the construction of a table at any given level in the hierarchy is to construct a co-occurrence matrix F. This is a square matrix of rank 256 which tabulates how often pairs of inputs occur. Hence Fij will be the frequency with which inputs i,j occur in some sample set. We will term the first level frequency table F1, the second F2 etc.
The frequency tables F follow a hierarchy exactly corresponding to that of the mapping tables T. Whilst we are constructing tables at level 1 in the hierarchy, these ‘inputs’ are horizontally adjacent pixels. Whilst we are constructing tables at level 2 in the hierarchy the inputs i, j are derived from the outputs of applying T1 mappings to the level below etc. It follows that we can only construct the frequency matrix for level n after the mapping table for level n-1 has been built. The frequency tables are built up by scanning a series of images and taking sample patches from all possible patch origins.
An example of such a co-occurrence matrix is shown in Figure 6. It will be observed that at level 1 there is a strong tendency for occurrences to be clustered along the 45° diagonal. This expresses the fact that adjacent pixels are positively correlated. Within the diagonal cluster, there is a further clustering towards the mean brightness.
Figure 6 A false colour representation of a frequency co-occurrence map of adjacent pixel values
Given the co-occurrence matrix, the algorithm proceeds as follows:
The effect of this partitioning algorithm is illustrated in figure 7.
Each split represents a bit of information about the vector. We wish these bits to be used in a way that will minimise some error metric, in our case Mean Squared Error(MSE). Just as the mean of a Probability Density Function(PDF) is the single estimator of the distribution which minimises expected MSE. If a PDF is split through its mean, the means of the resulting two partitions constitute the pair of estimators which minimise the MSE. The selection of the axis with greatest variance as the one to split, ensures that the split contributes the greatest amount to the minimisation of estimation errors.
Figure 7 An orthogonal-axis recursive partitioning of a frequency map, with resulting codes
The sequence in which these splits were performed can now be used to assign code numbers to the resultant rectangles. The first split is used to determine the most significant bit of the code number, the second split the next most significant etc. The bit is set to one for the right hand side rectangle of a horizontal split or the upper rectangle of a vertical split. You will observe, that data points that are widely separated in 2-space will tend to have codes that differ in their more significant bits. We thus arrive at codes that both preserve locality, and recursively ensures that the highest code is assigned to the top right and the lowest code to the bottom left rectangle.
Alongside the partitions of the co-occurrence matrix F, a parallel partitioning is performed on the mapping tables T . Each time F is subdivided, an additional bit is set or cleared for all entries in the corresponding partition of T . Thus the first partition to be carried out determines the most significant bit of all entries. The second level of recursion determines the second most significant bit etc. This process is illustrated in Figure 7. In each case, the partition whose elements have higher values in terms of the axis of greater variance has its bits set to 1 and the other has its bits set to 0.
After eight levels of recursion all of the bits of all the bytes in the table T will have been determined.
It should be noted that the training algorithm is fast. Segmentation of a frequency table with 216 entries takes about 10 seconds on a 75Mhz Pentium. For experimental work, this is a considerable advantage over other codebook training algorithms.
Forming the codebooks
What we have described so far constitutes a categorisation system for groups of 16 pixels, or, if the algorithm is extended further, for groups of 64 etc pixels. Given such categories, we have to derive the codebooks that correspond to it.
Once we have an appropriately partitioned frequency map, this can be used to form a codebook of the relevant scale. For the codebooks with patch size 4x4 the procedure is straightforward. We scan an image and use the HVQ tables to assign code numbers to sample patches within the image. For each code number we average all of the sample patches assigned to it, and the resulting set of intensity values forms the codebook entry.
Figure 8 a 4x4 codebook with 256 entries generated by the algorithm.
It should be emphasised that both when training and when forming the codebook, we use normalisation of the training data. That is, if we are building a set of HVQ tables up to patch size 4 x 4, we take 4 x 4 tiles, normalise their contrast and brightness. During encoding we will take 4 x 4 samples and normalise them prior to using the tables for indexing, thus the tables must be trained under the same conditions. Figure 8 shows a codebook of 4 by 4 patches formed in this way. Patches are allocated code numbers starting with 0 in the top left and 255 in the bottom right in column major order. Observe that patches that are vertically adjacent to one another, and thus differ by 1 in code, are similar in appearance.
Figure 9 A 16x16 pixel codebook corresponding to that in Figure 8.
When using the HVQ tables to compress an image patch whose size is greater than 4x4 we first shrink the patch to size 4x4, averaging source pixels as we go, normalise the resulting patch and then use the tables to determine the codebook index of the required patch. A consequence of this is that the large scale codebook entries should look like enlarged versions of the smaller ones. This is illustrated in Figure 9. We have experimented with two methods for constructing codebooks for patches of sizes 8x8, 16x16 etc.
Figure 10 A portion of a 32x32 codebook formed by source patch averaging.
The first of these approaches is preferred as it leads to smoother codebook patches with less high frequency noise. Using the second approach one gets a speckled effect on the largest patches, unless the training set is very large.
Figure 11. Examples of compression using the HVQ codebooks at different compression settings. Note that on the web there is a degradation of quality of these images due to recompression using JPEG and then rendering using 256 colours.
Figures 11a to 11c show the effect of compressing an image at 0.5 bits per pixel and at 0.25 bits per pixel. The original image is 192 pixels square and is encoded as a YUV 4:2:0 video frame. The decompressor incorporates a block removal filter. The codebook used incorporated 512 entries and had been trained on a sequence of 16 video frames of size 192 pixels square. The sequence of frames used for training did not include the sample image.
Figures 11d and 11e show the effect of holding the bits per frame constant whilst varying the codebook size. All codebooks were trained on the same training set. In general it has been found that increasing the size of the codebook used improves the quality of picture produced. Over large numbers of frames, a quadrupling of the codebook size produces an improvement of about a third of a decibel in the PSNR.