Algorithm for the fast construction of codebooks for hierarchical vector quantization

Paul Cockshott, Robert Lambert

**Tuesday, 27 January 1998**

Abstract

We present a recursive algorithm designed to construct quantization tables and codebooks for the hierarchical vector quantization of images. The algorithm is computationally inexpensive and yields high quality codebooks.

Introduction

Vector Quantization (VQ) has been found to be
an efficient technique for image compression . Following Shannon’s rate-distortion
theory, a better compression is achievable by coding vectors rather than
scalars. VQ represents a mapping from a *k*-dimensional space *R ^{k}*
to a finite subset Y of

In its simplest implementation for image compression,
VQ requires that an image is divided into fixed sized non-overlapping sub-image
blocks, each represented by a data vector *x *in the set *R ^{k}*.
The pixels in the sub-image block correspond to the elements in the data
vector

While simple, this implementation can only compress at a single compression ratio determined by the size of the block used to partition the image and the number of entries in the codebook. Changing either the block size or the codebook size will allow the compression ratio to vary, but involves the training and storage of many codebooks to take account of all block and codebook sizes used if the compression ratio is to be a dynamic variable.

An alternative is to use several distinct block sizes to encode an image with a codebook assigned to each block size. Variable dimension vector quantization operates on the principle that large blocks should be used to encode areas of low detail, while small blocks should encode areas of high detail. It represents a technique whereby an image can be compressed to a pre-determined compression ratio by defining, for instance, the number of blocks overall that may be used in the encoding of the image.

Ideally a single universal codebook, appropriate to the block size adopted and available to both compressor and decompressor, should be used. Increasing the size of the blocks can in principle raise degree of compression. However, to maintain quality the size of the codebook must be increased as the block size is increased. This limits the practicality of codebooks for large block sizes given the memory required for storage of the code words and the time required to identify the optimal match.

Memory requirements and search time can be traded for compression complexity and compression quality by adopting a coding scheme that takes advantage of correlation in the image data. For example, setting the mean intensity of each coding block to zero before it is vector-quantized can reduce the size of the codebook. Normalising block variance (or contrast) can achieve a further reduction in codebook size in addition to standardising the mean intensity.

To keep the size, search time and memory required by the codebooks low, the codebook vectors are normalised with respect to contrast and mean intensity. A block approximation by a codebook is thus a triple consisting of the codebook index, a contrast adjustment and brightness offset. The quantization of the contrast and means intensity values to integers; the size and content of the codebook and the overhead of encoding block position determine the quality of the compression.

Codebook searching

The basic problem is to search a codebook, containing for example 256 small archetypal 4 x 4 pixel images and find which of these most closely corresponds to a given 4 x 4 pixel area in the picture being compressed.

For the sake of simplicity we will restrict our initial discussion to the problem of searching codebooks whose patches are 4 pixels by 4 pixels in size. If this problem can be solved, one can readily handle patches of larger sizes. One simply constructs codebooks with larger patches that are appropriately scaled up versions of the 4 x 4 patches. Subsequently, when looking for a match, you subsample the area of the image one is trying to match, match it against the appropriate 4 x 4 patch and use the corresponding larger patch when reconstructing the image.

If we perform a linear scan of the codebook (Fig 1), the complexity of the search operation will be such that it involves 256 x 4 x 4 basic comparison operations, or, more generally takes a time proportional to the number of codebook entries. What we require is a lower complexity operation, analogous for example to hashing, whose time order is a small constant independent of the number of codebook entries. This could be achieved were we able to replace the linear search with an appropriate table lookup operation.

Figure 1 Linear scanning, this is conceptually the easiest way to find a matching entry from the codebook

A 4 x 4 pixel patch can be considered as a 16-element
vector. The problem of finding the best entry in a codebook is then one
of determining which of *n* points in a 16 dimensional metric space
is closest to a given sample point (sample patch), *n* being the number
of entries in the codebook.

Given sufficient memory, one could in principle,
construct a vast lookup table, a 16 dimensional array that used the original
16 co-ordinates specifying a patch as array indices to locate an index
number for the appropriate codebook entry. This would require an array
with 2 ^{128} entries, which is technically
infeasible.

Hierarchical vector quantization is a technique
used in image and sound compression for using table lookup operations to
find the nearest entry B[*i*] from a codebook **B** of vectors
to a source vector *v*. We can illustrate how it works with two examples.

1. Suppose that we are dealing with two-dimensional
vectors, whose members are byte-integers. Suppose that our codebook **B**0
contains 256 entries or less. We construct a two-dimensional array of byte-integers
**T**0 such that **B**0[**T**0[*i,j*]]
is the closest entry in **B**0 to the vector [*i,
j *]. Finding the closest entry in the codebook to a given input vector
can then be determined simply by looking up the table **T**0.

2. Suppose that we are dealing with 4 dimensional
vectors [*i, j, k, l* ], again expressed as byte-integers. Using table
**T**0 we can map these into two-dimensional vectors
[**T**0[*i,j*], **T**0[*k,l*]].
Suppose that our codebook **B**1 contains 256 entries
or less, each of which is a 4 dimensional vector [*p, q, r, s*]. We
construct a two-dimensional array of byte-integers **T**1
such that **B1**[**T1****[T0**[*i,j*],**T**0[*k,l*]]]
is the closest entry in **B**1 to the vector [**B**0[**T**0[*i,j*]]*,***B**0[**T**0[*k,l*]]].
Finding the closest entry in the code-book to an given input vector can
then be determined by looking up the table **T**0,
on adjacent pairs of vector elements, and then applying **T**1
to the resultant pair.

It is clear that the method can be applied recursively to match vectors of any power of two dimensions with entries in a codebook. The complexity of the operation is proportional to the number of dimensions of the vectors being compared and independent of the size of the codebooks.

Figure 2 Codebook lookup could in principle be reduced to an array indexing operation. It would just require a rather large array.

The great advantage of array lookup operations
is that they can be very fast, and although looking up an array of 2^{128}
entries is not possible, looking up smaller arrays is. If one were to use
a hierarchy of lookup operations, cascading lookups on a series of 2 dimensional
arrays each holding 2^{16} entries, then the
entire process could be performed in 15 lookup operations as shown in figure
3. This allows very fast video compression to be performed. We find that
we can compress CIF sized video sequences at 12.5 frames per second on
a 300Mhz Pentium.

Whilst this shows that it is in principle possible to reduce the codebook search to a series of lookup operations, it leaves us with the not inconsiderable problem of deriving a set of tables to use in the lookup process.

Recursive decomposition

If we look at figure 3, it is clear that the lookup process is essentially recursive. Pairs of 8 bit values are combined to form 16 bit table indices, resulting in more 8-bit values, which are themselves combined to word length indices, etc. This gives us a good hint that the process of constructing the tables is itself likely to be a recursive process.

At each level of the recursion during lookup we are reducing two dimensions to one dimension, two byte indices to a single byte output. Let us for a moment allow ourselves to assume that what we are dealing with at each input level are two-dimensional metric spaces. Since the overall aim of the whole lookup process is to find which 16 dimensional vector from our codebook is nearest to a given input vector, let us assume that this overall aim can be reflected at the two dimensional level. In that case the problem of constructing the individual lookup tables can be considered that of tabulating for each point in the two dimensional space the index of the output vector that is closest to it.

Consider the first level of tables. These take as indices two pixels. The pixels can be considered as orthogonal axes in a two dimensional space. We want to generate as output an 8 bit index number that identifies one of 256 vectors, each of two dimensions and for the two dimensional vector so identified to be the one that is most similar to the pair of input pixels. We view the process as having a codebook of 256 pixel pairs, and for all possible pairs of 8 bit pixels determining which codebook entry best approximates it.

Figure 3 How 15 table lookup operations can map 16 pixels to a codebook entry.

Figure 4 Approximation of pairs with archetypes

If we use the brightness values of a pair of adjacent pixels to define the x and y axes of a graph, then the putative 2-pixel codebook consists of a scatter of points that are to serve as archetypes for sample points. Around each archetypal point there will be a domain of attraction within which all sample points are mapped to the archetype. The domain of attraction of an archetype a consists of all points p such that there exists no archetype b for which d(b, p)<d(a, p), for some metric function d( ). We would expect these domains to polygonal. This is illustrated in Figure 4.

There is an obvious weakness with the scheme outlined above - we do not start out with a set of archetypes at the 2-pixel level. We must somehow derive these small-scale archetypes.

There is also a subtler problem. The cascaded table lookup process shown in figure 3 is essentially recursive . The mapping tables shown in figure 3, actually implement the domains of attraction illustrated in figure 3 by establishing a mapping between points in the plane and their attractors. This implies that there must be analogous diagrams to figure 4 that can be constructed to represent the level 2, level 3 etc mapping tables. However when we come to consider the level 2 mapping table, the pair of numbers that will be fed into it are no longer pixels, but codebook index numbers for a 2-pixel codebook.

When it is a question of looking up a table, this is of no import, from the standpoint of the computer’s indexing hardware, a number is a number irrespective of what it represents. To address an array one simply needs a collection of bits. But when it is a question of constructing a map of the domains of attraction like figure 8, it does matter what sort of numbers are used to construct the axes. Using bytes to determine co-ordinate positions according to some axes implies that the set of possible bit patterns within the bytes are fully ordered. It implies that the pattern 1000001 is adjacent to 10000010, etc. This is a prerequisite for the construction of domains of attraction as per figure 5. Such a domain of attraction groups neighbouring points, but the notion of neigbouring is only defined if the set of co-ordinate pairs constitute a metric space. This implies, in its turn, that the axes themselves must be metric spaces - one-dimensional in this case. Thus the codes that we use to label archetypes must be metric spaces.

Let **T** be a mapping table as shown in figure
3. Let a, b, g, d
be points in a two-dimensional space as shown in figure 4. We want neighbouring
points a, b, to
have codes **T**[a],
**T**[b] that
are themselves close together. We want points in the plane g,
d that are far apart to have codes **T**[g].
**T**[d] that
are themselves far apart. That is to say, we want |**T**[a]
- **T**[b]|
to be positively correlated with **d**(a,b)
for all a,b in
the plane, and for some appropriate metric **d.**

We have deduced that the construction of the mapping tables must meet the following constraints:

the construction process must be recursive

at each level it must tabulate basins of attraction

at each level it must derive the attractors of the basins of attraction

it must assign index numbers to the attractors such that distances in index number space are strongly correlated with distances in 2-D space.

To use it effectively we need to construct a series
of tables **T**0, 1, ..., m and codebooks **B**0,1,...,m
where **m** is **log**2 of the number
of pixels in the largest patch that we wish to use for compression.

The existing approach is to develop the codebooks
first and then derive the tables from them. Given a codebook **B**m-1
and a codebook of **B**m of the next higher level,
the index entry for **T**m[*i,j***] **is
chosen such that **B**m[Tm[*i,j***]]**
will be the closest approximation to **[B**m-1[*i***],B**m-1[*j***]]**
in the codebook at level **m**.. Since in this approach the construction
of the codebook is disconnected from the construction of the index tables
there is no guarantee that the final codebook will be efficiently used.
One might for instance find that for instance the 43^{rd} entry
in the higher level codebook was never selected as a good approximation
for any pair of lower level codebook entries. It can thus happen that a
significant portion of ones higher level codebook is never used. The method
that we present below is aimed to avoid this inefficient use of the codebook.
It attempts to construct tables and codebooks which maximise the signal
to noise ratio of the reconstructed image and which ensures that all codebook
entries will be used on ‘typical’ image sequences. Typical sequences are
here understood to mean sequences with similar statistical properties to
those of the training set.

The Construction algorithm

Our algorithm will construct a series of mapping
tables **T**^{1},** T**^{2},...**
**such that **T**^{1}_{ij}^{
}^{specifies the attractor index to be output at level 1
when given a pair of horizontally adjacent pixels i,j. At the next
level of the hierarchy, T2}
will map vertically adjacent pairs of **T**^{1}-mapped
horizontally adjacent pairs of pixels. The process is shown schematically
in figure 5, where the pixel square

ijkl

mnop

qrst

uvwx

is shown being passed through 4 levels of mapping to give a single output from the last table.

Figure 5 Hierarchy of table lookup operations used in the compression process, showing how horizontal and vertical groupings alternate at different levels.

The first phase in the construction of a table
at any given level in the hierarchy is to construct a co-occurrence matrix
**F**. This is a square matrix of rank 256 which tabulates how often
pairs of inputs occur. Hence **F*** _{ij}*
will be the frequency with which inputs

The frequency tables **F** follow a hierarchy
exactly corresponding to that of the mapping tables **T**. Whilst we
are constructing tables at level 1 in the hierarchy, these ‘inputs’ are
horizontally adjacent pixels. Whilst we are constructing tables at level
2 in the hierarchy the inputs *i, j* are derived from the outputs
of applying **T**^{1} mappings to the level
below etc. It follows that we can only construct the frequency matrix for
level *n* after the mapping table for level *n-1* has been built.
The frequency tables are built up by scanning a series of images and taking
sample patches from all possible patch origins.

An example of such a co-occurrence matrix is shown in Figure 6. It will be observed that at level 1 there is a strong tendency for occurrences to be clustered along the 45° diagonal. This expresses the fact that adjacent pixels are positively correlated. Within the diagonal cluster, there is a further clustering towards the mean brightness.

Figure 6 A false colour representation of a frequency co-occurrence map of adjacent pixel values

Given the co-occurrence matrix, the algorithm proceeds as follows:

- Determine the mean of the distribution.
- Determine whether the distribution of the data in the x or y direction has the greatest variance.
- Partition the frequency map into two, using a line normal to the axis of greatest variance passing through the mean of the distribution.
- For each of the resulting sub distributions, recursively apply steps 1 to 4 until the depth of recursion reaches 8.

The effect of this partitioning algorithm is illustrated in figure 7.

Each split represents a bit of information about the vector. We wish these bits to be used in a way that will minimise some error metric, in our case Mean Squared Error(MSE). Just as the mean of a Probability Density Function(PDF) is the single estimator of the distribution which minimises expected MSE. If a PDF is split through its mean, the means of the resulting two partitions constitute the pair of estimators which minimise the MSE. The selection of the axis with greatest variance as the one to split, ensures that the split contributes the greatest amount to the minimisation of estimation errors.

Figure 7 An orthogonal-axis recursive partitioning of a frequency map, with resulting codes

Allocating codes

The sequence in which these splits were performed can now be used to assign code numbers to the resultant rectangles. The first split is used to determine the most significant bit of the code number, the second split the next most significant etc. The bit is set to one for the right hand side rectangle of a horizontal split or the upper rectangle of a vertical split. You will observe, that data points that are widely separated in 2-space will tend to have codes that differ in their more significant bits. We thus arrive at codes that both preserve locality, and recursively ensures that the highest code is assigned to the top right and the lowest code to the bottom left rectangle.

Alongside the partitions of the co-occurrence
matrix **F**, a parallel partitioning is performed on the mapping tables
**T** . Each time **F** is subdivided, an additional bit is set or
cleared for all entries in the corresponding partition of **T **. Thus
the first partition to be carried out determines the most significant bit
of all entries. The second level of recursion determines the second most
significant bit etc. This process is illustrated in Figure 7. In each case,
the partition whose elements have higher values in terms of the axis of
greater variance has its bits set to 1 and the other has its bits set to
0.

After eight levels of recursion all of the bits
of all the bytes in the table **T** will have been determined.

It should be noted that the training algorithm
is fast. Segmentation of a frequency table with 2^{16} entries
takes about 10 seconds on a 75Mhz Pentium. For experimental work, this
is a considerable advantage over other codebook training algorithms.

Forming the codebooks

What we have described so far constitutes a categorisation system for groups of 16 pixels, or, if the algorithm is extended further, for groups of 64 etc pixels. Given such categories, we have to derive the codebooks that correspond to it.

Once we have an appropriately partitioned frequency map, this can be used to form a codebook of the relevant scale. For the codebooks with patch size 4x4 the procedure is straightforward. We scan an image and use the HVQ tables to assign code numbers to sample patches within the image. For each code number we average all of the sample patches assigned to it, and the resulting set of intensity values forms the codebook entry.

Figure 8 a 4x4 codebook with 256 entries generated by the algorithm.

It should be emphasised that both when training and when forming the codebook, we use normalisation of the training data. That is, if we are building a set of HVQ tables up to patch size 4 x 4, we take 4 x 4 tiles, normalise their contrast and brightness. During encoding we will take 4 x 4 samples and normalise them prior to using the tables for indexing, thus the tables must be trained under the same conditions. Figure 8 shows a codebook of 4 by 4 patches formed in this way. Patches are allocated code numbers starting with 0 in the top left and 255 in the bottom right in column major order. Observe that patches that are vertically adjacent to one another, and thus differ by 1 in code, are similar in appearance.

Figure 9 A 16x16 pixel codebook corresponding to that in Figure 8.

When using the HVQ tables to compress an image patch whose size is greater than 4x4 we first shrink the patch to size 4x4, averaging source pixels as we go, normalise the resulting patch and then use the tables to determine the codebook index of the required patch. A consequence of this is that the large scale codebook entries should look like enlarged versions of the smaller ones. This is illustrated in Figure 9. We have experimented with two methods for constructing codebooks for patches of sizes 8x8, 16x16 etc.

- The 4x4 codebook is simply scaled up by repeatedly doubling the pixels and applying a smoothing filter. This was how the codebook in figure 9 was generated from that in figure 8.
- For each size an independent run through the source data is made, the source patches are categorised using the same method as in the compressor, and the codebook entries formed as the averages of the source patches that fall into the category.

Figure 10 A portion of a 32x32 codebook formed by source patch averaging.

The first of these approaches is preferred as it leads to smoother codebook patches with less high frequency noise. Using the second approach one gets a speckled effect on the largest patches, unless the training set is very large.

Figure 11. Examples of compression using the HVQ codebooks at different compression settings. Note that on the web there is a degradation of quality of these images due to recompression using JPEG and then rendering using 256 colours.

Examples

Figures 11a to 11c show the effect of compressing an image at 0.5 bits per pixel and at 0.25 bits per pixel. The original image is 192 pixels square and is encoded as a YUV 4:2:0 video frame. The decompressor incorporates a block removal filter. The codebook used incorporated 512 entries and had been trained on a sequence of 16 video frames of size 192 pixels square. The sequence of frames used for training did not include the sample image.

Codebook size

Figures 11d and 11e show the effect of holding the bits per frame constant whilst varying the codebook size. All codebooks were trained on the same training set. In general it has been found that increasing the size of the codebook used improves the quality of picture produced. Over large numbers of frames, a quadrupling of the codebook size produces an improvement of about a third of a decibel in the PSNR.

View My Stats