PROBABILISTIC RETRIEVAL

Introduction

**
**

So far in this book we have made very little use of probability theory in modelling any sub-system in IR. The reason for this is simply that the bulk of the work in IR is non-probabilistic, and it is only recently that some significant headway has been made with probabilistic methods[1,2,3]. The history of the use of probabilistic methods goes back as far as the early sixties but for some reason the early ideas never took hold. In this chapter I shall be describing methods of retrieval, i.e. searching and stopping rules, based on probabilistic considerations. In Chapter 2 I dealt with automatic indexing based on a probabilistic model of the distribution of word tokens within a document (text); here I will be concerned with the distribution of index terms over the set of documents making up a collection or file. I shall be relying heavily on the familiar assumption that the distribution of index terms throughout the collection, or within some subset of it, will tell us something about the likely relevance of any given document.

Perhaps it is as well to warn the reader that some of the material in this chapter is rather mathematical. However, I believe that the framework of retrieval discussed in this chapter is both elegant and potentially extremely powerful*. Although the work on it has been rather recent and thus some may feel that it should stand the test of time, I think it probably represents the most important break-through in IR in the last few years. Therefore I unashamedly make this chapter theoretical, since the theory must be thoroughly understood if any further progress is to be made. There are a number of equivalent ways of presenting the basic theory; I have chosen to present it in such a way that connections with other fields such as pattern recognition are easily made. I shall have more to say about other formulations in the Bibliographic Remarks at the end of the chapter.

The fundamental mathematical tool for this chapter is Bayes' Theorem: most of the equations derive directly from it. Although the underlying mathematics may at first look a little complicated the interpretation is rather simple. So, let me try and immediately given some interpretation of what is to follow.

* This was recognised by Maron in his 'The Logic Behind a Probabilistic Interpretation' as early as 1964[4].

Remember that the basic instrument we have for trying to separate the relevant from the non-relevant documents is a matching function, whether it be that we are in a clustered environment or an unstructured one. The reasons for picking any particular matching function have never been made explicit, in fact mostly they are based on intuitive argument in conjunction with Ockham's Razor. Now in this chapter I shall attempt to use simple probability theory to tell us what a matching function should look like and how it should be used. The arguments are mainly theoretical but in my view fairly conclusive. The only remaining doubt is about the acceptability of the assumptions, which I shall try and bring out as I go along. The data used to fix such a matching function are derived from the knowledge of the distribution of the index terms throughout the collection of some subset of it. If it is defined on some subset of documents then this subset can be defined by a variety of techniques: sampling, clustering, or trial retrieval. The data thus gathered are used to set the values of certain parameters associated with the matching function. Clearly, should the data contain relevance information then the process of defining the matching function can be iterated by some feedback mechanism similar to the one due to Rocchio described in the previous chapter. In this way the parameters of the matching function can be 'learnt'. It is on matching functions derived from relevance information that we shall concentrate.

It will be assumed in the sequel that the documents are described by binary state attributes, that is, absence or presence of index terms. This is not a restriction on the theory, in principle the extension to arbitrary attributes can be worked out, although it is not clear that this would be worth doing[5].

**
Estimation or calculation of relevance**

**
**

When we search a document collection, we attempt to retrieve relevant documents without retrieving non-relevant ones. Since we have no oracle which will tell us without fail which documents are relevant and which are non-relevant we must use imperfect knowledge to guess for any given document whether it is relevant or non-relevant. Without going into the philosophical paradoxes associated with relevance, I shall assume that we can only guess at relevance through summary data about the document and its relationships with other documents. This is not an unreasonable assumption particularly if one believes that the only way relevance can ultimately be decided is for the user to read the full text. Therefore, a sensible way of computing our guess is to try and estimate for any document its probability of relevance

*
PQ* (relevance/document)

where the Q is meant to emphasise that it is for a specific query. It is not clear at all what kind of probability this is (see Good[6] for a delightful summary of different kinds), but if we are to make sense of it with a computer and the primitive data we have, it must surely be one based on frequency counts. Thus our probability of relevance is a statistical notion rather than a semantic one, but I believe that the degree of relevance computed on the basis of statistical analysis will tend to be very similar to one arrived at one semantic grounds. Just as a matching function attaches a numerical score to each document and will vary from document to document so will the probability, for some it will be greater than for others and of course it will depend on the query. The variation between queries will be ignored for now, it only becomes important at the evaluation stage. So we will assume only one query has been submitted to the system and we are concerned with

*
P* (relevance/document).

Let us now assume (following Robertson[7]) that:

(1) The *relevance* of a document to a request is independent of other documents

in the collection.

With this assumption we can now state a principle, in terms of probability of relevance, which shows that probabilistic information can be used in an optimal manner in retrieval. Robertson attributes this principle to W. S Cooper although Maron in 1964 already claimed its optimality[4].

*
The probability ranking principle*.
If a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.

Of course this principle raises many questions as to the acceptability of the assumptions.
For example, the Cluster Hypothesis, that closely associated documents tend to be relevant to the same requests, explicitly assumes the contrary of assumption (1).
Goffman[8] too, in his work has gone to some pains to make an explicit assumption of dependence.
I quote: 'Thus, if a document *x* has been assessed as relevant to a query s, the relevance of the other documents in the file *X* may be affected since the value of the information conveyed by these documents may either increase or decrease as a result of the information conveyed by the document *x*.' Then there is the question of the way in which overall effectiveness is to be measured.
Robertson in his paper shows the probability ranking principle to hold if we measure effectiveness in terms of Recall and Fallout.
The principle also follows simply from the theory in this chapter.
But this is not the place to argue out these research questions, however, I do think it reasonable to adopt the principle as one upon which to construct a probabilistic retrieval model.
One word of warning, the probability ranking principle can only be shown to be true for *one* query.
It does not say that the performance over a range of queries will be optimised, to establish a result of this kind one would have to be specific about how one would average the performance across queries.

The probability ranking principle assumes that we can calculate *P*(relevance/document), not only that, it assumes that we can do it accurately.
Now this is an extremely troublesome assumption and it will occupy us some more further on.
The problem is simply that we do not know which are the relevant documents, nor do we know how many there are so we have no way of calculating *P*(relevance/document).
But we can, by trial retrieval, guess at *P*(relevance/ document) and hopefully improve our guess by iteration.
To simplify matters in the subsequent discussion I shall assume that the statistics relating to the relevant and non-relevant documents are available and I shall use them to build up the pertinent equations.
However, at all times the reader should be aware of the fact that in any practical situation the relevance information must be guessed at (or estimated).

So returning now to the immediate problem which is to calculate, or estimate, *P*(relevance/ document).
For this we use Bayes' Theorem, which relates the posterior probability of relevance to the prior probability of relevance and the likelihood of relevance after observing a document.
Before we plunge into a formal expression of this I must introduce some symbols which will make things a little easier as we go along.

**
Basic probabilistic model***

**
**

Since we are assuming that each document is described by the presence/absence of index terms any document can be represented by a binary vector,

**
x **= (

where *xi* = 0 or 1 indicates absence or presence of the ith index term.
We also assume that there are two mutually exclusive events,

*
w*1 = document is relevant

*
w*2 = document is non-relevant.

* The theory that follows is at first rather abstract, the reader is asked to bear with it, since we soon return to the

nuts and bolts of retrieval.

So, in terms of these symbols, what we wish to calculate for each document is *P*(*w*1/** x**) and perhaps

Here *P*(*wi*) is the *prior* probability of relevance (*i*=1) or non-relevance (*i*=2), *P*(** x**/

which is the probability of observing ** x** on a random basis given that it may be either relevant or non-relevant.
Again this would be written as a density function

The decision rule we use is in fact well known as Bayes' Decision Rule. It is

[*P* (*w*1/** x**) >

The expression D1 is a short hand notation for the following: compare *P* (*w*1/** x**) with

* The meaning of [E -> p,q] is that if *E* is true then decide *p*, otherwise decide *q.*

In other words once we have decided one way (e.g. relevant) then the probability of having made an error is clearly given by the probability of the opposite way being the case (e.g. non-relevant).
So to make this error as small as possible for any given ** x** we must always pick that

This sum will be minimised by making *P* (error/** x**) as small as possible for each

Of course average error is not the only sensible quantity worth minimising.
If we associate with each type of error a *cost* we can derive a decision rule which will minimise the overall *risk*.
The overall risk is an average of the conditional risks *R*(*wi*/** x**) which itself in turn is defined in terms of a cost function

*
R* (*wi*/** x**) -

The overall risk is a sum in the same way that the average probability of error was, *R* (*wi*/** x**) now playing the role of

[*R* (*w*1/** x**) <

D1 and D2 can be shown to be equivalent under certain conditions. First we rewrite D1, using Bayes' Theorem, in a form in which it will be used subsequently, viz.

[*P*( **x**/*w*1) *P* (*w*1) > *P*( **x**/*w*2) *P*(*w*2) -> ** x** is relevant,

Notice that *P*(** x**) has disappeared from the equation since it does not affect the outcome of the decision.
Now, using the definition

[*R* (*w*1/** x**) <

When a special loss function is chosen, namely,

which implies that no loss is assigned to a correct decision (quite reasonable) and unit loss to any error (not so reasonable), then we have

[*R* (*w*1/** x**) <

which shows the equivalence of D2 and D3, and hence of D1 and D2 under a binary loss function.

This completes the derivation of the decision rule to be used to decide relevance or non-relevance, or to put it differently to retrieve or not to retrieve.
So far no constraints have been put on the form of *P*(** x**/

**
Form of retrieval function**

**
**

The previous section was rather abstract and left the connection of the various probabilities with IR rather open.
Although it is reasonable for us to want to calculate *P*(relevance/document) it is not at all clear as to how this should be done or whether the inversion through Bayes' Theorem is the best way of getting at it.
Nevertheless, we will proceed assuming that *P*(** x**/

* *

*
P*(** x**/

Later I shall show how this stringent assumption may be relaxed. We also for the moment ignore the fact that assuming independence conditional on both w1 and w2 separately has implications about the dependence conditional on w1 [[logicalor]] w2.

Let us now take the simplified form of *P*(** x**/

*
pi* = Prob (*xi* = 1/*w*1)

*
qi * = Prob (*xi* = 1/*w*2).

In words *pi*(*qi*) is the probability that if the document is relevant (non-relevant) that the *i*th index term will be present.
The corresponding probabilities for absence are calculated by subtracting from 1, i.e. 1 - *pi* = Prob (*xi* = 0/*w*1).
The likelihood functions which enter into D3 will now look as follows

To appreciate how these expressions work, the reader should check that *P*((0,1,1,0,0,1)/*w*1) = (1 - *p*1)*p*2*p*3(1 - *p*4)(1 - *p*5)*p*6.
Substituting for *P*(** x**/

where the constants *ai, bi* and *e* are obvious.

and

The importance of writing it this way, apart from its simplicity, is that for each document ** x** to calculate

The constant *C* which has been assumed the same for all documents ** x** will of course vary from query to query, but it can be interpreted as the cut-off applied to the retrieval function.
The only part that can be varied with respect to a given query is the cost function, and it is this variation which will allow us to retrieve more or less documents.
To see this let us assume that

Let us now turn to the other part of *g*(** x**), namely

There will be one such table for each index term; I have shown it for the index term *i* although the subscript *i* has not been used in the cells.
If we have *complete* information about the relevant and non-relevant documents in the collection then we can estimate *pi* by *r/R* and *qi* by (*n - r*)/(*N - R*).
Therefore *g*(** x**) can be rewritten as follows:

This is in fact the weighting formula F4 used by Robertson and Sparck Jones1 in their so called retrospective experiments. For later convenience let us set

There are a number of ways of looking at *Ki*.
The most interesting interpretation of *Ki* is to say that it measures the extent to which the *i*th term can discriminate between the relevant and non-relevant documents.

Typically the 'weight' *Ki*(*N,r,n,R*) is estimated from a contingency table in which *N* is not the total number of documents in the system but instead is some subset specifically chosen to enable *Ki* to be estimated.
Later I will use the above interpretation of *Ki* to motivate another function similar to *Ki* to measure the discrimination power of an index term.

**
The index terms are not independent**

**
**

Although it may be mathematically convenient to assume that the index terms are independent it by no means follows that it is realistic to do so.
The objection to independence is not new, in 1964 H. H. Williams[9] expressed it this way: 'The assumption of independence of words in a document is usually made as a matter of mathematical convenience.
Without the assumption, many of the subsequent mathematical relations could not be expressed.
With it, many of *the conclusions should be accepted with extreme caution.*' It is only because the mathematics become rather intractable if dependence is assumed that people are quick to assume independence.
But, 'dependence is the norm rather than the contrary' to quote the famous probability theorist De Finetti[10].
Therefore the correct procedure is to assume dependence and allow the analysis to simplify to the independent case should the latter be true.
When speaking of dependence here we mean *stochastic* dependence; it is not intended as logical dependence although this may imply stochastic dependence.
For IR data, stochastic dependence is simply measured by a correlation function or in some other equivalent way.
The assumption of dependence could be crucial when we are trying to estimate *P*(relevance/document) in terms of *P*(** x**/

In general the dependence can be arbitrarily complex as the following identity illustrates,

*
P*(** x**) =

Therefore, to capture all dependence data we would need to condition each variable in turn on a steadily increasing set of other variables. Although in principle this may be possible, it is likely to be computationally inefficient, and impossible in some instances where there is insufficient data to calculate the high order dependencies. Instead we adopt a method of approximation to estimate P(x) which captures the significant dependence information. Intuitively this may be described as one which looks at each factor in the above expansion and selects from the conditioning variables one particular variable which accounts for most of the dependence relation. In other words we seek a product approximation of the form

where (*m*1, *m*2, ..., *mn*) is a permutation of the integers 1,2, ..., *n* and *j*(.) is a function mapping *i* into integers less than *i*, and *P*(*xi*/xm0) is *P*(*xi*).
An example for a six component vector ** x** = (

*
Pt*(** x**) =

Notice how similar the A2 assumption is to the independence assumption A1, the only difference being that in A2 each factor has a conditioning variable associated with it.
In the example the permutation (*m*1, *m*2, ..., *m*6) is (1,2, ..., 6) which is just the natural order, of course the reason for writing the expansion for *Pt*(** x**) the way I did in A2 is to show that a permutation of (1,2, ..., 6) must be sought that gives a

The permutation and the function j(.) together define a *dependence tree* and the corresponding *Pt*(** x**) is called a probability distribution of (first-order) tree dependence.
The tree corresponding to our six variable example is shown in

write the function *Pt*(** x**) the way I did with

Applying this to the link between the root node *x*1 and its immediate descendant *x*2 in the example will shift the root to *x*2 and change the expansion to

*
Pt*(*x*1, *x*2, ...
*x*6) = *P*(*x*2)*P*(*x*1)/*x*2)*P*(*x*3/*x*2)*P*(*x*4/*x*2)*P*(*x*5/*x*2)*P* (*x*6/*x*5)

Of course, to satisfy the rule about relabelling we would exchange the names '1' and '2'.
All expansions transformed in this way are equivalent in terms of goodness of approximation to *P*(** x**).
It is therefore the

In what follows I shall assume that the relabelling has been done and that *xmi* = *xi*.

**
Selecting the best dependence trees**

**
**

Our problem now is to find a probability function of the form *Pt*(** x**) on a set of documents which is the best approximation to the true joint probability function

The goodness of the approximation is measured by a well known function (see, for example, Kullback[12]); if *P*(** x**) and

* That this is indeed the case is shown by Ku and Kullback[11].

is a measure of the extent to which *P*a(** x**) approximates

If the extent to which two index terms *i* and *j *deviate from independence is measured by the *expected mutual information measure* (EMIM) (see Chapter 3, p 41).

then the best approximation *Pt*(** x**), in the sense of minimising

is a maximum. This is a highly condensed statement of how the dependence tree is arrived at, unfortunately a fuller statement would be rather technical. A detailed proof of the optimisation procedure can be found in Chow and Liu[13]. Here we are mainly interested in the application of the tree structure.

One way of looking at the MST is that it incorporates the most significant of the dependences between the variables subject to the global constraint that the sum of them should be a maximum.
For example, in *Figure 6.1* the links between the variables (nodes, x1, ..., x6) have been put in just because the sum

*
I*(*x*1,*x*2) +*I*(*x*2,*x*3) + *I*(*x*2,*x*4) + *I*(*x*2,*x*5) + *I*(*x*5/*x*6)

is a maximum.
Any other sum will be less than or equal to this sum.
Note that it does *not *mean that any individual weight associated with an edge in the tree will be greater than one not in the tree, although this will mostly be the case.

Once the dependence tree has been found the approximating distribution can be written down immediately in the form A2. From this I can derive a discriminant function just as I did in the independent case.

*
ti *= Prob (*xi* = 1/*xj(i)* = 1)

*
ri* = Prob (*xi *= 1/*x*j(i) = 0) and *r*1 = Prob (*x*1 = 1)

*
P*(*xi* /*xj(i)*) = [*ti[xi]*(1* - ti*)[1] [- xi]] [xj(i) []*ri[xi ]*(1 - *ri*)[1] [- xi]] [1] [- xj(i)]

then

This is a *non-linear* weighting function which will simplify to the one derived from A1 when the variables are assumed to be independent, that is, when *ti* = *ri*.
The constant has the same interpretation in terms of prior probabilities and loss function.
The complete decision function is of course

*
g*(** x**) = log

which now involves the calculation (or estimation) of twice as many parameters as in the linear case.
It is only the sum involving *xj(i)* which make this weighting function different from the linear one, and it is this part which enables a retrieval strategy to take into account the fact that *xi* depends on *xj(i)*.
When using the weighting function a document containing *xj(i)*, or both *xi* and *xj(i)*, will receive a contribution from that part of the weighting function.

It is easier to see how *g*(** x**) combines different weights for different terms if one looks at the weights contributed to

and similarly for the other three settings of *xi* and *xj(i)*.

This shows how simple the non-linear weighting function really is.
For example, given a document in which *i* occurs but *j*(*i*) does not, then the weight contributed to *g*(** x**) is based on the ratio of two probabilities.
The first is the probability of occurrence of

**
Estimation of parameters**

**
**

The use of a weighting function of the kind derived above in actual retrieval requires the estimation of pertinent parameters.
I shall here deal with the estimation of *ti* and *ri* for the non-linear case, obviously the linear case will follow by analogy.
To show what is involved let me given an example of the estimation process using simple maximum likelihood estimates.
The basis for our estimates is the following 2-by-2 table.

Here I have adopted a labelling scheme for the cells in which [x] means the number of occurrences in the cell labelled x. Ignoring for the moment the nature of the set on which this table is based; our estimates might be as follows:

In general we would have two tables of this kind when setting up our function *g*(** x**), one for estimating the parameters associated with

The estimates shown above are examples of *point estimates*.
There are a number of ways of arriving at an appropriate rule for point estimation.
Unfortunately the best form of estimation rule is still an open problem[14].
In fact, some statisticians believe that point estimation should not be attempted at all[15].
However in the context of IR it is hard to see how one can avoid making point estimates.
One major objection to any point estimation rule is that in deriving it some 'arbitrary' assumptions are made.
Fortunately in IR there is some chance of justifying these assumptions by pointing to experimental data gathered from retrieval systems, thereby removing some of the arbitrariness.

Two basic assumptions made in deriving any estimation rule through Bayesian decision theory are:

(1) the form of the prior distribution on the parameter space, i.e. in our case the assumed

probability distribution on the possible values of the binomial parameter; and

(2) the form of the loss function used to measure the error made in estimating the

parameter.

Once these two assumptions are made explicit by defining the form of the distribution and loss function, then, together with *Bayes' Principle* which seeks to minimise the posterior conditional expected loss given the observations, we can derive a number of different estimation rules.
The statistical literature is not much help when deciding which rule is to be preferred.
For details the reader should consult van Rijsbergen[2] where further references to the statistical literature are given.
The important rules of estimating a proportion p all come in the form

where *x* is the number of successes in *n* trials, and *a* and *b* are parameters dictated by the particular combination of prior and loss function.
Thus we have a whole class of estimation rules.
For example when *a*=*b*=0 we have the usual estimate *x/n*, and when *a*=*b*=[1]/2 we have a rule attributed to Sir Harold Jeffreys by Good[16].
This latter rule is in fact the rule used by Robertson and Sparck Jones[1] in their estimates.
Each setting of *a* and *b* can be justified in terms of the reasonableness of the resulting prior distribution.
Since what is found reasonable by one man is not necessarily so for another, the ultimate choice must rest on performance in an experimental test.
Fortunately in IR we are in a unique position to do this kind of test.

One important reason for having estimation rules different from the simple *x/n*, is that this is rather unrealistic for small samples.
Consider the case of one sample (*n* = 1) and the trial result *x* = 0 (or *x* = 1) which would result in the estimate for *p* as *p* = 0 (or *p* = 1).
This is clearly ridiculous, since in most cases we would already know with high probability that

0 < *p* < 1.
To overcome this difficulty we might try and incorporate this prior knowledge in a distribution on the possible values of the parameter we are trying to estimate.
Once we have accepted the feasibility of this and have specified the way in which estimation error is to be measured, Bayes' Principle (or some other principle) will usually lead to a rule different from *x/n*.

This is really as much as I wish to say about estimation rules, and therefore I shall not push the technical discussion on this points any further; the interested reader should consult the readily accessible statistical literature.

**
Recapitulation**

**
**

At this point I should like to summarise the formal argument thus far so that we may reduce it to simple English. One reason for doing this now is that so far I have stuck closely to what one might call a 'respectable' theoretical development. But as in most applied subjects, in IR when it comes to implementing or using a theory one is forced by either inefficiency or inadequate data to diverge from the strict theoretical model. Naturally one tries to diverge as little as possible, but it is of the essence of research that heuristic modifications to a theory are made so as to fit the real data more closely. One obvious consequence is that it may lead to a better new theory.

The first point to make then, is that, we have been trying to estimate *P*(relevance/document), that is, the probability of relevance for a given document.
Although I can easily write the preceding sentence it is not at all clear that it will be meaningful.
Relevance in itself is a difficult notion, that the *probability* of relevance means something can be objected to on the same grounds that one might object to the probability of Newton's Second Law of Motion being the case.
Some would argue that the probability is either one or zero depending on whether it is true or false.
Similarly one could argue for relevance.
The second point is that the probability *P*(relevance/document) can be got at by considering the inverse probability *P*(** x**/relevance), thus relating the two through Bayes' Theorem.
It is not that I am questioning the use of Bayes' Theorem when applied to probabilities, which is forced upon us anyhow if we want to use probability theory consistently, no, what I am questioning is that

To approach the problem in this way would be useless unless one believed that for many index terms the distribution over the relevant documents is different from that over the non-relevant documents.
If we assumed the contrary, that is *P*(** x**/relevance) =

The elaboration in terms of ranking rather than just discrimination is trivial: the cut-off set by the constant in *g*(** x**) is gradually relaxed thereby increasing the number of documents retrieved (or assigned to the relevant category).
The result that the ranking is optimal follows from the fact that at each cut-off value we minimise the overall risk.
This optimality should be treated with some caution since it assumes that we have got the form of the

If one is prepared to let the user set the cut-off *after* retrieval has taken place then the need for a theory about cut-off disappears.
The implication is that instead of working with the ratio

we work with the ratio

In the latter case we do not see the retrieval problem as one of discriminating between relevant and non-relevant documents, instead we merely wish to compute the *P*(relevance/** x**) for each document

The decision rules derived above are couched in terms of *P*(** x**/

I will now proceed to discuss ways of using this probabilistic model of retrieval and at the same time discuss some of the practical problems that arise. At first I will hardly modify the model at all. But then I will discuss a way of using it which does not necessarily accord strictly with the assumptions upon which it was built in the first place. Naturally the justification for any of this will lie in the province of experimental tests of which many still remain to be done[17]. But first I shall explain a minor modification arising from the need to reduce the dimensionality of our problem.

**
The curse of dimensionality**

In deriving the decision rules I assumed that a document is represented by an *n*-dimensional vector where *n* is the size of the index term vocabulary.
Typically *n* would be very large, and so the dimension of the (binary) document vectors is always likely to be greater than the number of samples used to estimate the parameters in the decision function.
That this will lead to problems has been pointed out repeatedly in the pattern recognition literature.
Although the analysis of the problem in pattern recognition applies to IR as well, the solutions are not directly applicable.
In pattern recognition the problem is: given the number of samples that have been used to 'train' the decision function (our weighting function), is there an optimum number of measurements that can be made of an unknown pattern so that the average probability of correct assignment can be maximised? In our case how many index terms can we legitimately use to decide on relevance.
Hughes[18] shows that for a very general probabilistic structure the number of measurements is surprisingly small even though reasonably sized samples are used to 'train' the decision function.

Ideally one would like to be able to choose a (small) subset of index terms to which the weighting function *g*(.) would be restricted thereby maximising the average probability of correct assignment.
In pattern recognition there are complicated techniques for doing just that for the equivalent problem.
In information retrieval we are fortunate in that there is a natural way in which the dimensionality of the problem can be reduced.
We accept that the query terms are a fair guide to the best features to be used in the application of *g*(.) to decide between relevance and non-relevance.
Therefore rather than computing the weighting function for all possible terms we restrict *g*(.) to the terms specified in the query and possibly their close associates.
This would be as if during the retrieval process all documents are projected from a high dimensional space into a subspace spanned by a small number of terms.

**
Computational details**

**
**

I now turn to some of the more practical details of computing *g*(** x**) for each

**
1. Calculation of EMIM**

*
*

The calculation of the expected mutual information measure can be simplified. Then EMIM itself can be approximated to reduce the computation time even further. We take the simplification first.

When computing *I*(*xi*,*xj*) for the purpose of constructing an MST we need only to know the rank ordering of the *I*(*xi*,*xj*)'s.
The absolute values do not matter.
Therefore if we use simple maximum likelihood estimates for the probabilities based on the data contained in the following table (using the same notation as on p.125).

then *I*(*xi*,*xj*) will be strictly monotone with

This is an extremely simple formulation of EMIM and easy to compute. Consider the case when it is P(x) we are trying to calculate. The MST is then based on co-occurrence data derived from the entire collection. Once we have this (i.e. [1]) and know the number of documents ([9]) in the file then any inverted file will contain the rest of the frequency data needed to fill in the counts in the other cells. That is from [5] and [7] given by the inverted file we can deduce [2] [3] [4] [6] and [8].

The problem of what to do with zero entries in one of the cells 1 to 4 is taken care of by letting 0 log 0 = 0. The marginals cannot be zero since we are only concerned with terms that occur at least once in the documents.

Next we discuss the possibility of approximation. Maron and Kuhns[19] in their early work used

*
d*(*xi*,*xj*) = *P*(*xi* = 1, *xj* = 1) - *P*(*xi *=1) *P*(*xj* = 1) (*)

to measure the deviation from independence for any two index terms *i* and *j*.
Apart from the log this is essentially the first term of the EMIM expansion.
An MST (dependence tree) constructed on the basis of (*) clearly would not lead to an optimal approximation of *P*(** x**/

as a measure of association. No doubt there are other ways of approximating the EMIM which are easier to compute, but whether they can be used to find a dependence tree leading to good approximation of the joint probability function must remain a matter for experimental test.

*
2.
Calculation of MST*

*
*

There are numerous published algorithms for generating an MST from pairwise association measures, the most efficient probably being the recent one due to Whitney[21].
The time dependence of his algorithm is 0(*n[2]*) where *n* is the number of index terms to be fitted into the tree.
This is not a barrier to its use on large data sets, for it is easy to partition the data by some coarse clustering technique as recommended on p.59, after which the *total* spanning tree can be generated by applying the MST algorithm to each cluster of index terms in turn.
This will reduce the time dependence from 0(*n[2]*) to 0(*k[2]*) where *k* << *n*.

It is along these lines that Bentley and Friedman[22] have shown that by exploiting the geometry of the space in which the index terms are points the computation time for generating the MST can be shown to be almost always 0(*n* log *n*).
Moreover if one is prepared to accept a spanning tree which is almost an MST then a computation time of 0(*n* log *n*) is guaranteed.

One major inefficiency in generating the MST is of course due to the fact that all *n*(*n*-1)/2 associations are computed whereas only a small number are in fact significant in the sense that they are non-zero and could therefore be chosen for a weight of an edge in the spanning tree.
However, a high proportion are zero and could safely be omitted.
Unfortunately, the only way we can ignore them is to first compute them.
Croft[23] in a recent design for the single-link algorithm has discovered a way of ignoring associations without first computing them.
It does however presuppose that a file and its inverted form are available, so that if this is not so some computation time would need to be invested in the inversion.
It may be that a similar algorithm could be devised for computing an MST.

*
3.
Calculation of g(x)*

*
*

It must be emphasised that in the non-linear case the estimation of the parameters for *g*(** x**) will ideally involve a different MST for each of

There is a choice of how one would implement the model for *g*(** x**) depending on whether one is interested in setting the cut-off

If one assumes that the cut-off is set *a posteriori* then we can rank the documents according to *P*(*w*1/** x**) and leave the user to decide when he has seen enough.
In other words we use the form

to calculate (estimate) the probability of relevance for each document x.
Now here we only need to estimate for *P*(** x**/

**
An alternative way of using the dependence tree (Association Hypothesis)**

**
**

Some of the arguments advanced in the previous section can be construed as implying that the only dependence tree we have enough information to construct is the one on the entire document collection. Let us pursue this line of argument a little further. To construct a dependence tree for index terms without using relevance information is similar to constructing an index term classification. In Chapter 3 I pointed out the relationship between the MST and single-link, which shows that the one is not very different from the other. This leads directly to the idea that perhaps the dependence tree could be used in the same way as one would a term clustering.

The basic idea underlying term clustering was explained in Chapter 2.
This could be summarised by saying that based on term clustering various strategies for term *deletion* and *addition* can be implemented.
Forgetting about 'deletion' for the moment, it is clear how the dependence tree might be used to add in terms to, or expand, the query.
The reason for doing this was neatly put by Maron in 1964: 'How can one increase the probability of retrieving a class of documents that includes relevant material not otherwise selected? One obvious method suggests itself: namely, to enlarge the initial request by using additional index terms which have a similar or related meaning to those of the given request'[4].
The assumption here is that 'related meaning' can be discovered through statistical association.
Therefore I suggest that given a query, which is an incomplete specification of the information need and hence the relevant documents, we use the document collection (through the dependence tree) to tell us what other terms not already in the query may be useful in retrieving relevant documents.
Thus I am claiming that index terms directly related (i.e. connected) to a query term in the dependence tree are likely to be useful in retrieval.
In a sense I have reformulated the hypothesis on which term clustering is based (see p.31).
Let me state it formally now, and call it the *Association Hypothesis*:

If an index term is good at discriminating relevant from non-relevant documents then any closely associated index term is also likely to be good at this.

The way we interpret this hypothesis is that a term in the query used by a user is likely to be there because it is a good discriminator and hence we are interested in its close associates.
The hypothesis does not specify the way in which association between index terms is to be measured although in this chapter I have made a case for using EMIM.
Neither does it specify a measure of 'discrimination', this I consider in the next section.
The Association Hypothesis in some ways is a *dual* to the Cluster Hypothesis (p. 45) and can be tested in the same way.

**
Discrimination power of an index term**

**
**

On p. 120 I defined

and in fact there made the comment that it was a measure of the power of term i to discriminate between relevant and non-relevant documents.
The weights in the weighting function derived from the independence assumption A1 are exactly these *Ki*'s.
Now if we forget for the moment that these weights are a consequence of a particular model and instead consider the notion of discrimination power of an index term on its own merits.
Certainly this is not a novel thing to do, Salton in some of his work has sought effective ways of measuring the 'discrimination value' of index terms[24].
It seems reasonable to attach to any index term that enters into the retrieval process a weight related to its discrimination power.
*Ki* as a measure of this power is slightly awkward in that it becomes undefined when the argument of the log function becomes zero.
We therefore seek a more 'robust' function for measuring discrimination power.
The function I am about to recommend for this purpose is indeed more robust, has an interesting interpretation, and enables me to derive a general result of considerable interest in the next section.
However, it must be emphasised that it is only an example of a function which enables some sense to be make of the notion 'discrimination power' in this and the next section.
It should therefore not be considered unique although it is my opinion that any alternative way of measuring discrimination power in this context would come very close to the measure I suggest here.

Instead of *Ki* I suggest using the *information radius*, defined in Chapter 3 on p. 42, as a measure of the discrimination power of an index term.
It is a close cousin of the expected mutual information measure a relationship that will come in useful later on.
Using *u* and *v* as positive weights such as *u* + *v* = 1 and the usual notation for the probability functions we can write the information radius as follows:

The interesting interpretation of the information radius that I referred to above is illustrated most easily in terms of continuous probability functions.
Instead of using the densities *p*(./*w*1) and *p*(./*w*2) I shall use the corresponding probability measure u1 and u2.
First we define the average of two directed divergencies[25],

*
R* (u1, u2/*v*) = *uI* (u1/*v*) +*vI* (u2/*v*)

where *I*(u*i*/*v*) measures the expectation on u*i* of the information in favour of rejecting *v* for u*i* given by making an observation; it may be regarded as the information gained from being told to reject *v* in favour of u*i*.
Now the information radius is the minimum

thereby removing the arbitrary *v*.
In fact it turns out that the minimum is achieved when

*
v* = *u *u1 + *v *u2

that is, an average of the two distributions to be discriminated.
If we now adopt *u* and *v* as the prior probabilities then v is in fact given by the density

*
p*(** x**) =

defined over the entire collection without regard to relevance.
Now of this distribution we are reasonably sure, the distribution u1 and u2 we are only guessing at; therefore it is reasonable when measuring the difference between u1 and u2 that *v* should incorporate as much of the information that is available.
The information radius does just this.

There is one technical problem associated with the use of the information radius, or any other 'discrimination measure' based on all four cells of the contingency table, which is rather difficult to resolve. As a measure of discrimination power it does not distinguish between the different contributions made to the measure by the different cells. So, for example, an index term might be a good discriminator because it occurs frequently in the non-relevant documents and infrequently in the relevant documents. Therefore, to weight an index term proportional to the discrimination measure whenever it is present in a document is exactly the wrong thing to do. It follows that the data contained in the contingency table must be used when deciding on a weighting scheme.

**
Discrimination gain hypothesis**

**
**

In the derivation above I have made the assumption of independence or dependence in a straightforward way.
I have assumed either independence on both w1 and w2, or dependence.
But, as implied earlier, this is not the only way of making these assumptions.
Robertson and Sparck Jones[1] make the point that assuming independence on the relevant *and* non-relevant documents can imply dependence on the total set of documents.
To see this consider two index terms *i *and *j*, and

* *

*
P*(*xi, xj*) = *P*(*xi, xj* /*w*1)*P*(*w*1) + *P*(*xi*, *xi */*w*2) *P* (*w*2)

*
P*(*xi)* *P*(* xj*) = [*P*(*xi */*w*1)*P*(*w*1) + *P*(*xi*, *w*2) *P* (*w*2)] [*P*(*xj /w*1) *P*(*w*1) + *P*(*xj*,*w*2) *P* (*w*2)]

If we assume *conditional *independence on both w1 and w2 then

* *

*
P*(*xi, xj*) = *P*(*xi, /w*1) *P*(*xj, w*1) *P*(*w*1) + *P*(*xi* /*w*2) *P*(*xj/* *w*2) *P* (*w*2)

For unconditional independence as well, we must have

*
P*(*xi, xj*) = *P*(*xi) P*(*xj*)

This will only happen when *P*(*w*1) = 0 or *P*(*w*2) = 0, or *P*(*xi/ w*1) = *P*(*xi/w*2), or *P*(*xj/w*1) = *P*(*xj /w*2), or in words, when at least one of the index terms is useless at discriminating relevant from non-relevant documents.
In general therefore conditional *in*dependence will imply unconditional *de*pendence.
Now let us assume that the index terms are indeed conditionally independence then we get the following remarkable results.

Kendall and Stuart[26] define a partial correlation coefficient for any two distributions by

where *[[rho]] *(.,./*W*) and *[[rho]] *(.,.) are the conditional and ordinary correlation coefficients respectively.
Now if *X* and *Y* are conditionally independent then

*
[[rho]] *(*X*, *Y*/*W*) = 0

which implies using the expression for the partial correlation that

*
[[rho]] *(*X*, *Y*) = *[[rho]] *(*X*, *W*) *[[rho]] *(*Y*, *W*)

Since

| *[[rho]] *(*X*, *Y*) | <= 1 , | *[[rho]] *(*X*, *W*) | <= 1 , | *[[rho]] *(*Y*, *W*) | <= 1

this in turn implies that under the hypothesis of conditional independence

| *[[rho]] *(*X*, *Y*) | < | *[[rho]] *(*X*, *W*) | or | *[[rho]] *(*Y*, *W*) | (**)

Hence if *W* is a random variable representing relevance then the correlation between it and either index term is greater than the correlation between the index terms.

Qualitatively I shall try and generalise this to functions other than correlation coefficients, Linfott[27] defines a type of informational correlation measure by

*
rij *= (1 - exp (-2*I* (*xi*, *xj*) ) )[1/2 ]0 <= *rij *<= 1

or

where *I *(*xi, xj*) is the now familiar expected mutual information measure.
But rij reduces to the standard correlation coefficient *[[rho]]* (.,.) if (*xi, xj*) is normally distributed.
So it is not unreasonable to assume that for non-normal distributions rij will behave approximately like *[[rho]]* (.,.) and will in fact satisfy (**) as well.
But rij is strictly monotone with respect to *I *(*x,i, xj*) so it too will satisfy (**).
Therefore we can now say that under conditional independence the information contained in one index term about another is less than the information contained in either term about the conditioning variable W.
In symbols we have

*
I *(*xi, xj*) < *I *(*xi, W*) or *I *(*xj*, *W*),

where *I* (., *W*) is the information radius with its weights interpreted as prior probabilities.
Remember that *I* (.,*W*) was suggested as the measure of discrimination power.
I think this result deserves to be stated formally as an hypothesis when *W* is interpreted as relevance.

*
Discrimination Gain Hypothesis:* Under the hypothesis of conditional independence the statistical information contained in one index term about another is less than the information contained in either index term about relevance.

I must emphasise that the above argument leading to the hypothesis is not a proof.
The argument is only a qualitative one although I believe it could be tightened up.
Despite this it provides (together with the hypothesis) some justification and theoretical basis for the use of the MST based on *I *(*xi, xj*) to improve retrieval.
The discrimination hypothesis is a way of firming up the Association Hypothesis under conditional independence.

One consequence of the discrimination hypothesis is that it provides a rationale for ranking the index terms connected to a query term in the dependence tree in order of I(term, query term) values to reflect the order of discrimination power values.
The basis for this is that the more strongly connected an index term is to the query term (measured by EMIM) the more discriminatory it is likely to be.
To see what is involved more clearly I have shown an example set-up in Figure 6.2.
Let us suppose that x1 is the variable corresponding to the query term and that *I *(*x*1*, x*2) < *I *(*x*1*, x*3) < *I *(*x*1*, x*4) < *I *(*x*1*, x*5) then our hypothesis says that without knowing in advance how good a discriminator each of the index terms 2,3,4,5 is, it is reasonable to assume that *I *(*x*2*, W*) < *I *(*x*3*, W*) < *I *(*x*4*, W*) <*I *(*x*5*, W*).
Clearly we cannot guarantee that the index terms will satisfy the last ordering but it is the best we can do given our ignorance.

**
**

**
Bibliographic remarks**

**
**

The basis background reading for this chapter is contained in but a few papers. One approach to probabilistic weighting based on relevance data derives from the work of Yu and his collaborators[28,29]. The other is contained in the already frequently cited paper of Robertson and Sparck Jones[1]. Unfortunately, both these approaches rely heavily on the assumption of stochastic independence. My own paper[2] and the one of Bookstein and Kraft[3] are the only ones I know of, which try and construct a model without this assumption. Perhaps an earlier paper by Negoita should be mentioned here which discusses an attempt to use non-linear decision functions in IR[30]. Robertson's recent progress in documentation on models gives a useful summary of some of the more recent work[31].

According to Doyle[32] (p.267), Maron and Kuhns[19] were the first to describe in the open literature the use of association (statistical co-occurrence) of index terms as a means of enlarging and sharpening the search.
However, Doyle himself was already working on similar ideas in the late fifties[33] and produced a number of papers on 'associations' in the early sixties[34,35].
Stiles in 1961[36], already apparently aware of Maron and Kuhns work, gave an explicit procedure for using terms co-occurring significantly with search terms, and not unlike the method based on the dependence tree described in this chapter.
He also used the [[chi]][2] to measure association between index terms which is mathematically very similar to using the expected mutual information measure, although the latter is to be preferred when measuring dependence (see Goodman and Kruskal for a discussion on this point[37]).
Stiles was very clear about the usefulness of using associations between index terms, he saw that through them one was 'able to locate documents relevant to a request *even though the document had not been indexed by the term used in the request*'[36].

The model in this chapter also connects with two other ideas in earlier research. One is the idea of inverse document frequency weighting already discussed in Chapter 2. The other is the idea of term clustering. Taking the weighting idea first, this in fact goes back to the early paper by Edmundson and Wyllys[38], we can write

or in words, for any document the probability of relevance is inversely proportional the probability with which it will occur on a random basis.
If the *P*(document) is assumed to be the product of the probabilities of the individual index terms being either present or absent in the document then after taking logs we have the inverse document frequency weighting principle.
It assumes that the likelihood *P*(document/relevance) is constant for all documents.
Why it is exactly that this principle works so well is not yet clear (but see Yu and Salton's recent theoretical paper[39]).

The connection with term clustering was already made earlier on in the chapter.
The spanning tree can be looked upon as a classification of the index terms.
One of the important consequences of the model described in this chapter is that it lays down precisely how the tree should be used in retrieval.
Earlier work in this area was rather *ad hoc* and did not lead to conclusive results[40].

It should be clear now that the quantitative model embodies within one theory such diverse topics as term clustering, early association analysis, document frequency weighting, and relevance weighting.

Previous Chapter: Search Strategies

**
References**

**
**

1. ROBERTSON, S.E. and SPARCK JONES, K., 'Relevance weighting of search terms', *Journal of the American Society for Information Science*, **27**, 129-146 (1976)

2. van RIJSBERGEN, C.J., 'A theoretical basis for the use of co-occurrence data in information retrieval',* Journal of Documentation*, **33**, 106-119 (1977).

3. BOOKSTEIN, A. and KRAFT, D., 'Operations research applied to document indexing and retrieval decisions', *Journal of the ACM*, **24**, 410-427 (1977).

4. MARON, M.E., 'Mechanized documentation: The logic behind a probabilistic interpretation', In: *Statistical Association Methods for Mechanized Documentation* (Edited by Stevens *et al.*) National Bureau of Standards, Washington, 9-13 (1965).

5. OSBORNE, M.L., 'A Modification of Veto Logic for a Committee of Threshold Logic Units and the Use of 2-class Classifiers for Function Estimation', Ph.D. Thesis, Oregon State University (1975).

6. GOOD, I.J., *Probability and the Weighting of Evidence*, Charles Griffin and Co.Ltd., London (1950).

7. ROBERTSON, S.E., 'The probability ranking principle in IR', *Journal of Documentation*, **33**, 294-304 (1977).

8. GOFFMAN, W., 'A searching procedure for information retrieval', *Information Storage and Retrieval*, **2**, 294-304 (1977).

9. WILLIAMS, J.H., 'Results of classifying documents with multiple discriminant functions', In : *Statistical Association Methods for Mechanized Documentation *(Edited by Stevens et al.) National Bureau of Standards, Washington, 217-224 (1965).

10. DE FINETTI, B.,* Theory of Probability*, **Vol. 1**, 146-161, Wiley, London (1974).

11. KU, H.H. and KULLBACK, S., 'Approximating discrete probability distributions',* IEEE Transactions on Information Theory,* **IT-15**, 444-447 (1969).

12. KULLBACK, S., *Information Theory and Statistics*, Dover, New York (1968).

13. CHOW, C.K. and LIU, C.N., 'Approximating discrete probability distributions with dependence trees', *IEEE Transactions on Information Theory,* **IT-14**, 462-467 (1968).

14. COX, D.R., 'The analysis of multivariate binary data', *Applied Statistics*, **21**, 113-120 (1972).

15. BOX, G.E.P. and TIAO, G.C., *Bayesian Inference in Statistical Analysis*, 304-315, Addison-Wesley, Reading, Mass. (1973).

16. GOOD, I.J., *The Estimation of Probabilities: An Essay on Modern Bayesian Methods*, The M.I.T. Press, Cambridge, Mass. (1965).

17. HARPER, D. and van RIJSBERGEN, C.J., 'An evaluation of feedback in document retrieval using co-occurrence data', *Journal of Documentation*, **34**, 189-216 (1978).

18. HUGHES, G.F., 'On the mean accuracy of statistical pattern recognizers', *IEEE Transactions on Information Theory*, **IT-14**, 55-63 (1968).

19. MARON, M.E. and KUHNS, J.L., 'On relevance, probabilistic indexing and information retrieval', *Journal of the ACM*, **7**, 216-244 (1960).

20. IVIE, E.L., 'Search Procedures Based on Measures of Relatedness Between Documents', Ph.D. Thesis, M.I.T., Report MAC-TR-29 (1966).

21. WHITNEY, V.K.M., 'Minimal spanning tree, Algorithm 422', *Communications of the ACM*, **15**, 273-274 (1972).

22. BENTLEY, J.L. and FRIEDMAN, J.H., *Fast Algorithm for Constructing Minimal Spanning Trees in Coordinate Spaces*, Stanford Report, STAN-CS-75-529 (1975).

23. CROFT, W.B., 'Clustering large files of documents using single link', *Journal of the American Society for Information Science*, **28**, 341-344 (1977).

24. SALTON, G., *Dynamic Information and Library Processing,* Prentice-Hall, Englewoods Cliffs, NJ., 441-445 (1975).

25. JARDINE, N. and SIBSON, R., *Mathematical Taxonomy,* pp. 12-15, Wiley, London and New York (1971).

26. KENDALL, M.G. and STUART, A., *Advanced Theory of Statistics,* **Vol. 2**, 2nd ed., Griffin, London (1967).

27. LINFOOT, E.H., 'An informational measure of correlation', *Information and Control*, **1**, 85-89 (1957).

28. YU, C.T. and SALTON, G., "Precision Weighting - An effective automatic indexing method', *Journal of the ACM*, **23**, 76-85 (1976).

29. YU, C.T., LUK, W.S. and CHEUNG, T.Y., 'A statistical model for relevance feedback in information retrieval', *Journal of the ACM*, **23**, 273-286 (1976).

30. NEGOITA, C.V., 'On the decision process in information retrieval', *Studii si cercetari de documentare*, **15**, 269-281 (1973).

31. ROBERTSON, S.E., 'Theories and models in information retrieval', *Journal of Documentation*, **33**, 126-148 (1977).

32. DOYLE, L.B., *Information Retrieval and Processing*, Melville Publishing Co., Los Angeles, California (1975).

33. DOYLE, L.B., 'Programmed interpretation of text as a basis for information retrieval systems', In: *Proceedings of the Western Joint Computer Conference*, San Francisco, 60-63 (1959).

34. DOYLE, L.B., 'Semantic road maps for literature searchers', *Journal of the ACM,* **8**, 553-578 (1961).

35. DOYLE, L.B., 'Some compromises between word grouping and document grouping', In: *Statistical Association Methods for Mechanized Documentation* (Edited by Stevens *et al*.) National Bureau of Standards, Washington, 15-24 (1965).

36. STILES, H.F., 'The association factor in information retrieval', *Journal of the ACM*, **8**, 271-279 (1961).

37. GOODMAN, L. and KRUSKAL, W., 'Measures of association for cross-classifications', *Journal of the American Statistical Association*, **49**, 732-764 (1954).

38. EDMUNDSON, H.P. and WYLLYS, R.E., 'Automatic abstracting and indexing - Survey and recommendations', *Communications of the ACM*, **4**, 226-234 (1961).

39. YU, C.T. and SALTON, G., 'Effective information retrieval using term accuracy', *Communications of the ACM*, **20**, 135-142 (1977).

40. SPARCK JONES, K., *Automatic Keyword Classification for Information Retrieval*, Butterworths, London (1971).

Previous Chapter: Search Strategies